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TO ALL WHOM IT MAY CONCERN: 

Be it known that WE, ANDREY RZHETSKY and SERGEY 
KALACHTKOV, citizens of Russia, whose post office addresses are 560 Riverside Drive, 
1 IF New York, New York 10027; and 154 Haven Avenue, 1303, New York, New York 
10032 respectively; MICHAEL O. KRAUTHAMMER, citizen of Switzerland, whose 
post office address is 27 W. 76th Street, Apt. 3 A, New York, N.Y., 10023; CAROL 
FRIEDMAN and PAULINE KRA, citizens of the United States, whose post office 
addresses are 14 Dimitri Place, Larchmont, New York, 10538 and 109-14 Ascan Ave. 
Forest Hills, N.Y., 1 1375, respectively, have invented an improvement in 

GENE DISCOVERY THROUGH COMPARISONS OF NETWORKS 
OF STRUCTURAL AND FUNCTIONAL RELATIONSHIPS 
AMONG KNOWN GENES AND PROTEINS 
of which the following is a 



SPECIFICATION 



The invention described herein was funded in part by a grant from the 
National Library of Medicine, namely, Grant Number's LM06274 and LM05627. The 
United States Government may have certain rights to the invention. The present 
specification contains a computer program listing which appears as a microfiche 
Appendix H. 
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STATEMENT REGARDING MATERIAL SUBJECT TO COPYRIGHT 
A portion of the disclosure of this patent document contains material 
which is subject to copyright protection. The copyright owner has no objection to the 
facsimile reproduction by anyone of any portion of the patent document, as it appears in 
any patent granted from the present application or in the Patent and Trademark Office file 
or records available to the public, but otherwise reserves all copyright rights whatsoever. 

An appendix containing source code listing utilized in practicing an 
exemplary embodiment of the invention is included as part of the Specification. 

1. INTRODUCTION 
The present invention relates to methods for identifying novel genes 
comprising: (i) generating one or more specialized databases containing information on 
gene/protein structure, function and/or regulatory interactions; and (ii) searching the 
specialized databases for homology or for a particular motif and thereby identifying a 
5 putative novel gene of interest. The invention may further comprise performing 

simulation and hypothesis testing to identify or confirm that the putative gene is a novel 
gene of interest. 

The present invention relates to natural language processing and extraction 
of relational information associated with genes and proteins that are found in genomics 
10 journal articles. To enable access to information in textual form, the natural language 
processing system of the present invention provides a method for extracting and 
structuring information found in the literature in a form appropriate for subsequent 
applications. Specifically, the present invention provides for the generation of 
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specialized databases containing information on gene/protein structure, function and 
regulatory interactions based on the retrieval of such information from research articles 
and databases, and computer representation of such information in a manner that allows 
efficient access to the extracted information. 
5 The invention further provides for the use of the specialized databases for 

identifying novel genes based on detection of sequence similarities and domain/motif 
matches between genes/proteins, computation and interpretation of phylogenetic trees for 
multigene families, and analysis of homologous regulatory networks. The methods of the 
invention are based on the observation that functionally similar regulatory systems are 
10 generated during evolution by genetic duplication of ancestral genes. Thus, a comparison 
of homologous/similar networks within the same organism and between different species 
will allow the identification of genes absent in one of the systems under comparison. In 
this way genes that contribute to the phenotype of a specific disease associated with a 
particular biological system under analysis may be identified. 

15 2. BACKGROUND OF THE INVENTION 

2.1. NATURAL LANGUAGE PROCESSING 
Researchers working in molecular biology must constantly consider the 
information present in the literature relating to their regulatory systems of interest and the 
genes and proteins that operate within those systems. Unfortunately, to remain up-to-date 
20 on the relevant literature, the researcher is required to perform laborious reading and 
manual integration of research articles, each of which may address a narrow subject. 
Therefore, technology that enables rapid retrieval of information from literature and 
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manipulation of derived functional data should have a dramatic effect on the accesss of 
the researcher to important facts and ultimately should facilitate the discovery of novel 
human genes. 

Natural language processing is an automated system that provides for a 
5 complex of programs for automatic retrieval of information from text analysis and for the 
computer representation of that information in a form that allows efficient access and 
extraction of that information. MedLee (Medical Language Extraction and Encoding 
System) has recently been successfully used for processing different types of medical 
texts as described in co-pending United States Patent Application Serial Number 

10 09/370,329, incorporated herein in its entirety by reference (see also, Friedman et al., 
1994, J. Amer. Med. Inf. Assoc. 1:161-174; Hripcsak et al. 1995, Ann. Intern. Med. 
122:681-688; Hripcsak et al., 1998, Meth. Inform. Med.; Jain et al., 1996, Proc. AMIA 
Annu. Fall Symp. 542-546; Knirsch et al, 1998). When tested, MedLEE was on average 
as successful in retrieving reports associated with specified clinical connections as twelve 

1 5 medical experts invited for evaluation of the system. 

Another text analysis technique has recently been developed that combines 
finite-state machines with statistical machine learning approaches. These models extract 
detailed semantic information from texts (e.g., see Hatzivassiloglou 1996, In Klavens, 
J.L., and Resnick, P.S. (eds) The Balancing Act: Combining Symbolic and Statistical 

20 Approaches to Language, MIT Press, Cambridge, MA) when extensive prior knowledge 
about the domain is not available. The techniques have been subsequently applied to the 
tasks of (i) automatically identifying medical terms for the automated summarization of 
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research articles reporting on clinical studies and (ii) sanitizing sensitive information in 
patient records so that they can be widely disseminated for research purposes. 

A number of projects have also been developed as statistical information 
extraction tools that operate with limited or no prior knowledge about the application 
5 domain. These earlier efforts include XTRACT, a tool that recovers collocational 
restrictions between words that has been licensed to more than thirty sites worldwide 
(Smadja, F., 1993, J. Comp. Ling. 19:143-177), CHAMPOLLION, a system that retrieves 
bilingual mappings between words and phrases in parallel texts from different languages 
(Smadja, F. et al. 1996, J. Computational Linguistics 22:1-38), and a system that 
10 automatically aligns noisy, semi-parallel texts from different languages (Fung, P. and 
McKeown, K.R., 1997, Machine Translation 11:23-29). 



2.2. IDENTIFICATION OF NOVEL GENES 
A variety of different methods are currently utilized for the identification 
1 5 and characterization of novel genes. Perhaps the most widely used method for generating 
large quantities of sequence information is via high throughput nucleotide sequencing of 
random DNA fragments. A disadvantage associated with this gene discovery technique 
is that in most instances when genes are identified their function is unknown. 

For identification of specific disease genes, positional cloning is currently 
20 the most widely used method. The positional cloning approach combines methods of 

formal genetics, physical mapping and mutation analysis and usually starts with a precise 
description of the disease phenotype and a tracing of the disease through families of 
affected individuals. Genetic linkage data obtained from the analysis of affected families 
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frequently allows the determination of an approximate genomic localization of the 
candidate disease gene with a precision of several millions of nucleotides. Once 
localized, the genetically defined chromosomal region is then recovered from genomic 
libraries as a contiguous set of genomic fragments. Genes residing in the disease-related 
5 region are determined by analysis of transcripts that are transcribed from the genomic 
fragment. From this analysis an initial set of candidate genes for a particular disease are 
identified based on the presence of the gene product in the biological system affected by 
disease and a correlation between its expression pattern and the pattern of disease 
progression. 

10 Important information for selection of candidate genes also comes from 

analysis of their homology with genes known to be part of the same or related biological 
system. Finally, the ultimate proof of association between a gene and a genetic disorder 
comes from mutational analysis of a gene in patients affected by the disorder and from 
demonstration of a statistical correlation between occurrence of mutation and the disease 

15 phenotype. 

Although positional cloning is a powerful method for gene discovery, the 
experimental method is extremely tedious and expensive. Moreover, disease genes 
implicated in genetically complex disorders, i.e., those controlled by multiple loci, can 
hardly be found using this strategy because of the complications associated with multiple 
20 loci linkage analysis. 

Specialized databases for homology searches have also been utilized in 
disease gene discovery projects. In recent years a number of efficient sequence 
comparison tools have been developed such as the BLAST (Basic Local Alignment 
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Search Tool) family of programs designed for comparison of a single "search sequence" 
with a database (see Altschul et aL, 1990, J. Mol. Biol 215:403-410; Altschul et aL, 
1997, Nucleic Acids Res. 25:3389-3402), the family of Hidden Markov Model methods 
for comparison of a set of aligned sequences that usually represent a protein motif or 
5 domain with a database (e.g., Krogh et aL, 1994, J. Mol. Biol. 235:1501-1531; Grundy et 
aL, 1997, Biochem Biophys. Res. Commun. 231:760-6) and various other comparison 
tools (Wu et aL, 1996, Comput. Appl. Biosci 12:109-1 18; Neuwald et aL, 1995, Protein 
Sci. 4:1618-1632; Neuwald, 1997, Nucleic Acids Res. 25:1665-1677). 

When used in disease gene discovery projects, homology searches can be 

1 0 enhanced by creating specialized databases that utilize statistical analysis for evaluating 
significance of sequence similarities in comparison of new sequences with a database of 
known sequence. Such databases are fine-tuned to the size of the database used (Altschul 
et aL, 1990, J. Mol. Biol. 215:403-410; Altschul et aL, 1997, Nucleic Acids Res. 25:3389- 
3402), so that the same level of homology between a search sequence and a database 

1 5 sequence can be determined to be highly significant if the search sequence is compared 
with a smaller database, or insignificant and thus undetectable, if the search sequence is 
compared with a larger database. 

In alternatives to standard homology searches, in projects oriented towards 
gene discovery, researchers usually have some a priori knowledge about the set of 

20 genes/proteins that might display important similarity to the unknown new gene. 

Therefore, selecting an a priori defined set of genes/proteins for comparison with new 
experimental sequences is a feasible and useful strategy. This strategy was successfully 
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applied to search for homo logs of disease genes in yeast and nematode genomes by 
Mushegian et al (1997, Proc. Natl. Acad. Sci USA 94:5831-5836). 

Two homologous genes taken from different species that originate from 
the nearest common ancestor by speciation are referred to as orthologs, while any two 
5 genes that originate from a common ancestor via a series of events involving 

intragenomic duplications are call paralogs. Tatusov et al. (1994, Proc. Natl, Acad. Sci 
USA 91:12091-12095) describe comparisons of proteins encoded by the genomes of 
different phylogenetic lineages and elucidation of consistent patterns of sequence 
similarities permitting the delineation of clusters of orthologous groups (COGs). Each 

10 COG consists of individual orthologous genes or orthologous groups of paralogs from 
different phylogenetic lineages. Since orthologs typically have the same function, the 
classification of known genes and proteins into clusters of orthologous groups permits the 
assignment of a function to a newly discovered gene or protein by merely classifying it 
into a COG. Although Tatusov describes a method for assigning a function to a newly 

1 5 discovered gene, he does not describe a method for predicting the existence of 

undiscovered genes. In addition, Yuan, et al. attempted simultaneous reconstruction of a 
species tree and identification of paralogous groups of sequences and detection of 
orthologs in sequence databases (Yuan et al., 1998, Bioinformatics 143:285-289). 

Other groups have aimed at capturing interactions among molecules 

20 through the use of programs designed to compare structures and functions of proteins 
(Kazic 1994, In: Molecular Modeling: From Virtual Tools to Real Problems . 
Kumosinski, T. and Liebman, M.N. (Eds.), American Chemical Society, Washington, 
D.C. pp. 486-494; Kazic, 1994, In: New Data Challenges in Our Information Age 
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Glaesar, P.S. and Millward, M.T.L. (Eds.). Proceedings of the Thirteenth International 
CODATA Secretariat, Paris pp. C133-C140; Goto et al., 1997, Pac. Symp. Biocomput. p. 
175-186; Bono et al., 1998, Genome Res. 8:203-210; Selkov et al., 1996, Nucleic Acids 
Res. 24:26-28). These projects are significantly different from the inventive methods 
5 described herein because they do not describe methods for deducing the existence of as 
yet unknown genes based on comparisons of regulatory pathways and gene structure 
between one or more species. The present invention provides a method for increasing 
the sensitivity of analysis methods through the generation of specialized databases. 

3. SUMMARY OF THE INVENTION 
1 0 In accordance with the present invention there is provided methods for 

identification of novel genes comprising (i) generating one or more specialized databases 
containing information on gene/protein structure, function and/or regulatory interactions; 
and (ii) searching the specialized databases for homology or for a particular motif and 
thereby identifying a putative novel gene of interest. The invention may further comprise 
1 5 performing simulation and hypothesis testing to identify or confirm that the putative gene 
is a novel gene of interest. 

The invention is based, in part, on the observation that functionally similar 
regulatory systems are generated during evolution by genetic duplication of ancestral 
genes. Thus, by comparing phylogenetic trees or regulatory networks and identifying 
20 genes and/or proteins absent in one system under comparison, the existence of as yet 

unidentified genes and/or proteins can be predicted. To make meaningful comparisons of 
phylogenetic trees it is necessary to distinguish between orthologs and paralogs. The 
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present invention provides a method useful for discriminating between orthologs and 
paralogs and inferring the existence of as yet unidentified genes and/or proteins. 

The present invention relates to natural language processing and extraction 
of relational information associated with genes and proteins that are found in genomics 
5 journal articles. Specifically, the natural language processing system of the invention is 
used to parse the articles published in biological journals focusing on structure and 
interactions among genes and proteins followed by computer representation of such 
interactions. 

In accordance with the present invention, specialized databases are 
1 0 developed that contain information on gene/protein structure and interactions based on 
information derived from preexisting databases and/or research articles including 
information on interactions among genes and proteins, their domain/motif structure and 
their subcellular and tissue expression/distribution patterns. 

The invention relates to a sequence analysis program which utilizes the 
1 5 specialized database for comparison of a single sequence, processing the output into a 
sequence alignment, computing phylogenetic trees, and analyzing these trees to predict 
undiscovered genes. This program also includes a set of tools for generating 
motif/domain models from multiple sequence alignments of known genes and for using 
these models for extraction of structurally and/or functionally homologous sequences 
20 from databases which contain raw sequence data. 

The invention further provides for a simulation and hypothesis testing 
program which relies on the specialized databases of gene/protein interactions for 
identifying potentially undiscovered members of multigene families through comparisons 
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of regulatory networks for different species and testing hypotheses with regard to 
regulatory cascades. A comparison of homologous regulatory networks within the same 
organism and between different species of organisms will allow the identification of 
genes absent in one of the systems under comparison, thus providing a set of candidate 
5 genes. In this way, genes that contribute to the phenotype of a specific disease associated 
with a particular biological system under analysis may be identified, mapped and 
subjected to mutational analysis and functional studies. 

4. BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 is a block diagram illustrating the three major programs of the 
10 method according to the present invention: (i) the generation of specialized databases 
based on information on gene/protein structure, function and regulatory interactions 
derived from research papers and databases; (ii) sequence analysis; and (iii) simulation 
and hypothesis testing; 

Figure 2 is a block diagram of an information extraction system in 
1 5 accordance with a preferred embodiment of the present invention; 

Figure 3 is a diagram illustrating the object representation of molecules 
and relations between them; 

Figure 4 shows a set of keywords defining proteins involved in apoptosis 
pathways, these keywords having been utilized for generating a specialized sequence 
20 database Apoptosis3, this list having been compiled manually for testing the concept of 
specialized databases; 
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Figure 5 shows a "species tree," which is a graph depicting the correct 
order of speciation events leading to a set of present day species; a "gene tree," which is a 
graph depicting a history of a few genes from the same species, where each species can 
be represented by multiple paralogous genes (because the set of known genes is 
5 incomplete for most genomes, and there are often multiple representations of the same 
gene family in the same genome, the gene tree can be drastically different from the 
corresponding species tree); and a "reconciled tree", which is the gene tree that would be 
obtained if gene deletions were completely forbidden and all genes were known for all 
species under analysis; 

1 0 Figure 6 shows the original tree of ALDH sequences, indicating sequence 

clusters where bacterial, plant, fungal and nematode orthologous genes are present, but a 
human ortholog was not yet known; 

Figure 7 shows the same phylogenetic tree as in Figure 6 with an 
additional human protein, referred to as antiquitin which was discovered by the method of 

1 5 the invention; 

Figure 8 is a schematic diagram illustrating functional network-based gene 
discovery in accordance with the present invention; 

Figure 9A presents diagrams depicting the regulatory relationships among 
hypothetical proteins (denoted with Arabic numerals) of hypothetical species A and B. 
20 Proteins in different species denoted with the same numeral are considered orthologous. 
The diagrams show that regulatory relationships between a pair of proteins can be of 
three different kinds; 
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Figure 9B, 9C, and 9D are diagrams representing Boolean operations OR, 
AND, and XOR, on arcs of the two oriented graphs of Figure 9 A, the same operations 
being applicable to the set of vertices of the two oriented graphs; 

Figure 10 is a diagram representing a hypothetical example of defining 
5 homologous protein networks in two different species using protein motifs, the diagram 
showing only two hypothetical proteins (1 and 2) for species A and three hypothetical 
proteins (1, 3, and 4) for species B. Protein 1 in both species has motifs a and p, protein 
2 has motifs 6, e, and (, and proteins 3 and 4 have motifs 5 and (, and e, respectively. 
The motif analysis can indicate that proteins 3 and 4 in species B may collectively 
1 0 perform the same function as protein 2 in species A; 

Figure 1 1 A and 1 IB are diagrams respectively representing hypothetical 
examples of evaluating the impact of a "knockout" of hypothetical gene A on the 
expression of a hypothetical gene B. The effect of knock-out of gene A calculated by 
multiplication along the shortest pathway connecting genes A and B is inhibition of 
1 5 gene B, the resulting effect being zero if the orientation of only one arc in the same 
pathway is reversed; 

Figure 12 is a flow chart representing the scheme of gene discovery 
analysis involving motif/domain analysis in accordance with the present invention; and 

Figure 13 Identification of genes in C elegans containing either POZ or 
20 kelch domains. The protein excession numbers are indicated adjacent to the different 

protein domains. The protein corresponding to accession number gi/1 132541 contains a 
POZ domain, death domain, kinase domain and heat repeat. 
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Figure 14 A. Two human sequences with the closest homology to the C. 
elegans sequence gi/1 132541. 

Figure 14B. Computed gene tree indicating that the identified human gene 
represents an ortholog of the C. elegans gene gi/1 132541. 

Figure 14C. Nucleotide sequence of the death domain gene. 

Figure 14D. Deduced amino acid sequence of the death domain protein. 

Figure 15. Identification of candidate gene implicated in the etiology of 
Chronic Lymphocytic Leukemia (CLL). Sequence homology between a CLL region 
open reading frame and mouse Rptl (sp/P15533/RPTl) is presented. 

Figure 16A-B. Model of regulatory functions of Rptl. Figure 16A 
indicates that in mouse T lymphocytes Rptl serves as a repressor of the gene for 
interleukin 2 receptor (IL-2R). Figure 16B demonstrates that when Rptl is knocked out, 
the regulatory effect is manifested as a block of the apoptotic pathway for T-lymphocytes 
resulting in accumulation of T-lymphocytes in blood. 

Figure 17A. Two EST sequences identified by searching a protein dbEST 
using the mouse Mad3 protein as a query. 

Figure 17B. Nucleotide sequence of the human Mad3 gene. 

Figure 17C. Complete sequence of the human Mad3 protein. A search was 
conducted to identify overlapping sequences. The complete sequence of the gene was 
assembled and the amino acid sequence deduced. The translated human Mad3 sequence 
consists of 206 amino acid residues 81% of which are identical to the mouse Mad3 
protein. 
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Figure 17D. Multiple alignment of the human Mad3 amino acid sequence 
with known Mad proteins. 

Figure 1 8 A. Phylogenetic tree indicating relationship between three 
known mouse Mad genes and their two human homologs. 
5 Figure 1 8B. Phylogenetic tree including new human Mad3 sequence. The 

phylogenetic tree indicates that the new human gene belongs to the family of Mad 
proteins and is an ortholog of mouse Mad3. 



5. DETAILED DESCRIPTION OF THE INVENTION 
1 0 The present invention provides methods for identification of novel genes 

comprising: (i) generating specialized databases containing information on gene/protein 
structure, function and regulatory interactions and, (ii) sequence analysis which includes 
homology searches and motif analysis thereby identifying a putative novel gene of 
interest. The invention may further comprise performing simulation and hypothesis 
1 5 testing to identify or confirm that the putative gene is a novel gene of interest. 

The specialized databases are constructed utilizing information concerning 
gene/protein structure or function derived from unpublished data, research articles and/or 
existing databases. The specialized databases can be used to identify novel genes by: 
(i) searching for motif/domain combinations characteristic for a putative gene of interest; 
20 (ii) phylogenetic tree analysis of homologous genes for predicting the existence of yet 
undiscovered genes; (iii) comparing members of interactive gene/protein networks from 
different species for predicting the existence of yet undiscovered genes; and (iv) testing a 
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hypothesis with regard to known interactions of homo logs from other species in 
regulatory pathways. 

5.1. THE NATURAL LANGUAGE PROCESSING 
The present invention relates to a natural language processing system that 
5 is designed to parse the electronic versions of articles published in journals that report on 
structural interactions among genes and proteins. The system provides a method for 
extracting information on interactions among genes and proteins, their domain/motif 
structure, and/or their sub-cellular and tissue expression/distribution patterns, followed by 
computer representation of such information. 

10 The general natural language-processing system of the invention is 

schematically depicted in Figure 2. The collection phase automatically collects articles 
from appropriate literature, and selects articles that contain relevant information using 
Keyword search techniques. In the next phase, the preprocessor standardizes the selected 
articles so that they consist of tagged ASCII text where the tags delineate critical 

15 components of the article. The next phase, termed the extraction phase, retrieves and 
classifies biological entities, i.e., as names of proteins, genes and small molecules. In 
addition, the relationship extraction phase recovers structural relationships between the 
entities. This phase is followed by a phase which performs an analysis of the sequence of 
events. 

20 The final phase of the system processes the output extracted from an 

article to remove redundancies, inconsistencies and to incorporate implicit information 
before adding the extracted knowledge consisting of biological entities, their attributes, 
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conditional constraints, and relationships between them, for subsequent use in analysis 
and hypothesis testing. The information extraction system as depicted in Figure 2, 
referred to herein as "GENIE," is designed for use as a general processor within the 
domain of genomics literature although the system may also be used in other specialized 
5 domains. GENIE is an adaptation of MedLEE developed for the medical domain. 
GENIE uses the same source code as MedLEE but the Lexicons and grammar were 
adapted for genomics literature. 

The information extraction system of the present invention is described 
below, by way of example, with reference to the genomics domain uses of GENIE. It is 
10 written in Quintus Prolog and uses the Unix or Windows operating systems, as described 
in detail below. 

A natural-language phrase included in text document is understood as a 
delimited string comprising natural-language terms or words. The string is computer 
readable as obtained, e.g, 9 from a pre-existing database, a keyboard input, optical 

1 5 scanning of typed or handwritten text, or processed voice input. The delimiter may be a 
period, a semicolon, an end-of-message signal, a new-paragraph signal, or any other 
suitable symbol recognizable for this purpose. Within the phrase, the terms may be 
separated by another type of delimiter such as a blank or another suitable symbol. 

As a result of phrase parsing, terms in a natural-language phrase are 

20 classified, (e.g., as referring to a gene, a protein, or their interactions) and the 

relationships between the interactions are established and represented in a standard form. 
For example, in the sentence "Rap inhibited fyn", the structured form would be: 
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[action,inacti vate, [protein,rap] , [protein, fyn] ] . 
In such an example, the interaction is "inactivate", the agent is "Rap" and the target is 
"fyn." More complex sentences consisting of nested relationships, such as "The 
activation of BAD was suppressed by the phosphorylation of INK" can also be parsed 
and represented appropriately. The structured output form for this sentence would be: 
[action,inactivate, [action,phosphorylate,x, [protein jnk] , [action,acti vate,x, [protein,bad] ] 
In the first example, the primary interaction is "inactivate"; in the second example, an 
interaction "phosphorylate" is the agent where the protein "jnk" is its target (the agent of 
"phosyphorylate" in not specified and thus is represented as "x"). In this example, the 
target of "inactivate" is also an interaction "activate" where the target is the protein "bad" 
and the agent is unknown. 

While parsing is based on both syntactic and semantic grammatical 
patterns, the substances in a domain are normally only semantic categories such as 
"protein", "gene", and "small molecule." There are no corresponding syntactic 
categories needed for these substances because they are normally all nouns. However, 
each action can be categorized both semantically and syntactically. An action, which is a 
semantic category, can generally occur syntactically as a verb "inactivate" or as a noun 
"inactivation." Therefore there are two sets of lexical entries for the actions: syntactic 



NY02-257310 1 



A31869A-70050.0881 

and semantic. The syntactic lexicon for actions specifies the main syntactic category 
such as "v" for verb, "ving" for progressive form of verb, and "activation" for noun. 
The semantic entries for actions not only categorize the actions, but also specify features 
for each action. For example, one feature provides the number of arguments that are 
expected for the action, i.e., some actions are associated with two arguments because they 
have an agent and a target as "inactivate", and others just have an agent "mutate." The 
lexicon of substances and structures appears as Appendix A; the syntactic lexicon for 
actions appears as Appendix B; and the semantic lexicon of actions appears as 
Appendix C. 

A second feature specifies whether or not the arguments should be 
reversed when obtaining the target form. For example the arguments of "attributable to" 
should be reversed, i.e., in "the phosphorylation of jnk is attributable to the activation of 
bad", the underlying action is "cause" (from "attributable to"), the agent is the "activation 
of bad" and the target is "the phoshorylation of jnk"), whereas the arguments of 
"activates" is not( i.e. in "jnk activates bad" , the agent is "jnk" and the target is "bad"). 

Figure 2 shows a preprocessor module of GENIE by which natural- 
language input text is received. The preprocessor thus performs lexical lookup to identify 
and categorize multi-word and single word phases within each sentence. The output of 
this component consists of a list of word elements where each element is associated with 
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a word or multi-word phrase in the report. For example, assuming that the sentence "bad 
functions as a negative regulator of the activation of jnk" is at the beginning of the 
report, it would be represented as a list of elements where each element is a word or 
phrase. For example, element 1 is associated with "bad", element 2 with the multi-word 
5 phrase "functions as a negative regulator of, element 8 with "the", and element 9 with 
"activation". The remainder of the list of word positions would be associated with the 
remaining words in the report. Some of the phrases may not need lexical lookup because 
they already have been tagged by a previous component. Such a tagging system is 
described below in Section 5.2. 

10 The second component of the GENIE system is the parser. It utilizes the 

grammar and categories assigned to the phrases of a sentence to recognize well-formed 
syntactic and semantic patterns in the sentence and to generate structured output forms. 
The parser proceeds by starting at the beginning of the sentence element list and 
following the grammar rules. When a semantic or syntactic category is reached in the 

15 grammar, the lexical item corresponding to the next available unmatched element is 
obtained and its corresponding lexical definition is checked to see whether or not it 
matches the grammar category. If it does match, the word or phrase is removed from the 
unmatched sentence list, and the parsing proceeds. If a match is not obtained, an 
alternative grammar rule is tried. If no analysis can be obtained, an error recovery 
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procedure is followed so that a partial analysis is attempted. The actual grammar used for 
GENIE appears as Appendix D. 

The parser module of GENIE uses the lexicon, and a grammar module to 
generate target forms. Thus, in addition to parsing of complete phrases, subphrase 
5 parsing can be used to an advantage where highest accuracy is not required. In case a 
phrase cannot be parsed in its entirety, one or several attempts can be made to parse a 
portion of the phrase for obtaining useful information in spite of a possible loss of 
information. 

Conveniently, each module is software-implemented and stored in 
10 random-access memory of a suitable computer, e.g., a work-station computer. The 

software can be in the form of executable object code, obtained, e.g., by compiling from 
source code. Source code interpretation is not precluded. Source code can be in the form 
of sequence-controlled instructions as in Fortran, Pascal or "C", for example. 
Alternatively, a rule-based system can be used such a Prolog, where suitable sequencing 
15 is chosen by the system at run-time. 

An illustrative portion of the GENIE system is shown in the Appendix D 
in the form of a Prolog source listing with comments. The following is further to the 
comments. 
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Process sents with getjnputsents, process sects and outputresults reads 
in an input stream, processes sections of the input stream according to parameter settings, 
and produces output according to the settings, respectively. Among parameters supplied 
to Process _sents are the following: Mode (specifying the parsing mode) and Protocol 
5 (html or plain). Process _sents is called by another predicate, after user- specified 
parameters have been processed. 

The parsing modes are selected by GENIE so as to parse a sentence or 
phrase structure using a grammar that includes one or more patterns of semantic and 
syntactic categories that are well-formed. For example, for the phrase "bad inactivates 

10 jnk" a legitimate pattern can be substancel action substance2, wherein substancel = 
protein bad, action = "inactivates" and substance2 = "jnk." However, if parsing fails, 
various error recovery modes are utilized in order to achieve robustness. The error 
recovery techniques use methods such as segmenting the sentence, processing large 
chunks of the sentence, and processing local phrases. Each recovery technique is likely 

15 to increase sensitivity but decrease specificity and precision. Sensitivity is the 

performance measure equal to the true positive rate of the natural language processing, 
i.e., the ratio of information extracted by the natural language processing system that 
should have been extracted. Specificity is the performance measure equal to the true 
negative information rate of the system, i.e., the ratio of information not extracted by the 
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NLP system that should not have been extracted. Precision is the reliability of the 
system, i.e., the ratio of information extracted correctly compared to all the information 
that was extracted. In processing a report, the most specific mode is attempted first, and 
successive less specific modes are used only if needed. 

In accordance with the preferred embodiments of the present invention, 
the parser of Figure 2 includes five parsing modes, Modes 1 through 5, for parsing 
sentences or phrases. Nominally, the parser is configured to first select Mode 1. If Mode 
1 is not possible, the program continues with Mode 2 and so forth until parsing is 
complete. With Mode 1, the initial segment is the entire sentence and all words in the 
segment must be defined. This mode requires a well-formed pattern for the complete 
segment. 

Mode 2 requires that the sentence or phrase be segmented at certain types 
of words or phrases, e.g., " is attributable to." Here, an attempt is made to recognize each 
segment independently, i.e., a first segment ending with the word "is " and a second 
segment beginning with the word after "to." The segmenting process is repeated until an 
analysis of each segment is obtained or until segmenting is no longer possible. 

Mode 3 requires a well-formed pattern for the "largest" prefix of the 
segment, i.e., usually at the beginning of the segment. This occurs when a sentence 
contains a pattern at the end which is not in the grammar but a beginning portion that is 
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included. For example, in "bad inactivates jnk at this time", the beginning of the 
sentence "bad inactivates jnk" will be parsed and the remainder will be skipped. 

Mode 4 requires that undefined words be skipped and an analysis be 
attempted in accordance with Mode 1 . Mode 4 is useful where there are typographical 
5 errors and unknown words. For example, in the phrase "abc bad inactivates jnk", the 
word abc is unknown to the system and will be ignored but the remainder of the phrase 
will be parsed. 

Mode 5 first requires that the first word or phrase in the segment 
associated with an action be found. Next, an attempt is made to recognize the phrase 
10 starting with the leftmost recognizable argument. For example, in "during bad inactivates 
jnk on the fifth day," the phrase "bad inactivates jnk" will be parsed and the remaining 
words will not be. If no analysis is found, recognition is retried at the next possible 
argument to the right. This process continues until an analysis is found. 

Process _sects with get_section and parse _sentences gets each section and 
15 generates intermediate output for the sentences in each section. 

Write produces the output as a list consisting of relations and interactions 
Setargs sets arguments or parameter values based on user input or by 

default. 
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The structured output generated by the GENIE program uses a frame- 
based representation. Each frame specifies the informational type, the value, and 
arguments or modifier slots which are also frames. Consider the text data input "bad 
inactivates the phosphorylation of jnk." A corresponding output, as shown below, is a 
5 frame denoting an action, which has the value inactivate; in addition, there are two 

arguments. The first argument is a protein bad and the second argument is an action with 
the value phosphorylate, which has two arguments. The first argument is x signifying that 
the agent has not been specified; the second argument is a protein with the value jnk. The 
second argument is the target: 
10 [action,mactive,[protein,bad],[action,phosphorylate,x ? [protein,jnk 

In summary, a computer system has been disclosed that generates 
structured information concerning protein and gene interactions and relationships. 



5 .2. USE OF BLAST FOR FINDING GENE AND 
PROTEIN NAMES IN JOURNAL ARTICLES 

15 In a specific embodiment of the invention, an exhaustive list of gene and 

protein names, extracted from GeneBank, is translated into a different alphabet system by 

substituting each character in the name with a predetermined unique nucleotide 

combination. The encoded names are then imported into the BLAST database using the 
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FASTA format. The scientific journals are translated, using the same nucleotide 
combinations, into a continuous string of nucleotides. A query is then used to match the 
translated journals against the nucleotide representation of gene and protein names in the 
BLAST database. Significant alignments associated with gene and protein names are 
5 listed in the BLAST output file, which is subsequently processed using Perl-scripts. The 
final result consists of the original journal article with XML tags surrounding the gene 
and protein names. 

To adapt the problem to BLAST'S statistical foundation, different 
measures were undertaken to limit the output to the most relevant gene and protein 

1 0 names. In addition, in order to fine-tune the matching process, different BLAST 

parameters were adjusted, such as the word size (which sets the size of the high scoring 
words, thus influencing the sensitivity of finding HSPs) and mismatch penalty (exact vs 
approximate matching). 

In a specific embodiment of the invention, gene and protein names are 

15 extracted from GeneBank's gene symbol index file. The following is an excerpt of the 

file after discarding entries that are either composed of only numbers or of less than two 

alphabetic letters: 

gfap gamma 
hox alO 
20 hox al 
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wac 3 '-end 
pit-l/ghf-1 variant 
[...] 

This list of gene and protein names is translated into a different alphabet 
5 system by substituting each character in the name with a predetermined unique nucleotide 
combination. The conversion chart is listed in Appendix E. The encoded names are then 
imported into the BLAST database using the FASTA format. For example, the first entry 
in the list above is "gfap gamma." After translation using the conversion chart, the same 
name appears as follows: 
1 0 AGCAACTAAACACCC ATCC AAGCAAACACAC AC AC AAAC 

Thus, the complete FASTA entry looks like this: 
>gi|l species,gp,gfap gamma 

AAGCAACTAAACACCCATCCAAGCAAACACACACACAAAC 
In FASTA, the definition line (marked with f >') contains information about 
15 the database entry. This line can contain any kind of information. The information 
important for this particular example is the third entry in the definition line, 'gp', that 
specifies that the name can represent a gene or a protein. If the name is unambigous, then 
the definition line states that the name is only associated with a gene ( f g T ) or protein ( T p'). 
The fourth entry in the definition line is the name of the protein or gene, "gfap gamma" in 
20 this case. 
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The second line in the FASTA format normally contains the actual 
sequence of the protein/gene. In the example presented, the second line contains the 
translated protein or gene name. 

All gene and protein names are translated into the nucleotide 
5 representation and converted into the FASTA format. Then, the database containing these 
FASTA entries are specially compiled for use in BLAST queries using a program that is 
included in the BLAST package called "formatdb". 

Thus, the scientific journals are translated, using the same nucleotide 
combinations, into a continuous string of nucleotides. For example, the sentence "In the 
10 absence of costimulation, T cells activated through their antigen ..." is translated into 

"AAGTACAGATCCACGGAAGGAACGATCCAAACAAAGACGCAACGACAGAA 
ATAACGATCCACATAACTATCCAAATACATACGCACGGAAGTACACACGTAA 
TTAAACACGGAAGTACATACAGATCCATCCACGGATCCAAATAACGAATTAA 
TTACGCATCCAAACAAATACGGAAGTACTCAAACACGGAACGAACCATCCAC 
15 GGAAGGACCTACATACGTAAGCAAGGATCCACGGAAGGAACGAAGTACCTA 
TCCAAACACAGACGGAAGTAAGCAACGACAGATCC " 

A query is then used to match the translated journals against the nucleotide 
representation of gene and protein names in the BLAST database. The query is executed 
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using the blastall program that is included in the BLAST package. The query line looks 
like: 

blastall -p blastn -d FASTA.dat -i query.txt 

The flag T p' denotes the sub-program (blastn is a sub-program of blastall 
5 that performs nucleotide matches), T d' denotes the file that contains the FASTA entries 
and T denotes the translated query text. 

Significant alignments associated with gene and protein names are listed 
in the BLAST output file. This is an excerpt from a BLAST output file: 

gi|63624 species ? gp,ner 
10 Length =12 

Score = 24.4 bits (12), Expect = 3e-05 

Identities -12/12 (100%) 

Strand - Plus / Plus 

Query: 729 acagaacgacct 740 
15 Sbjct: 1 acagaacgacct 12 

The first line denotes the database entry. The second line denotes the 
database sequence length, followed by the alignment score and the E-value. The next line 
indicates paired matches, mismatches and gapped alignment (the latter two are not shown 
in this example). The lines 'Query' and 'Sbjct' show the actual alignment between the 
20 query and database sequence. This output file is subsequently processed using a Perl- 
script (see Appendix F). The script shown in Appendix G scans the output file, which is 
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sometimes several megabytes long, for any segments that start at position 1 of the 
database sequence (thus disregarding any segments that are only part of the sequence). In 
addition, the script allows for 10% mismatches between the aligned sequences for long 
sequences (as shown in the script of Appendix E), or 0% mismatches for short sequences. 
5 After scanning the output file, an intermediary file that lists the candidate sequences is 
created: 

tran|365|381|gp|18493 

tranjl|17|gp|18493 

peci|549|565|gp|58106 
10 il-2|621|637|gp|82396 

il-2|325|341|gp|82396 

gati|193|209|gp|92088 

prod|641|657|gp|52292 

rapl|105|121|gp|49898 
15 spec|545|561|gp|33183 

crip|385|401|gp|118905 

crip|21|37|gp|l 18905 

as|161|177|gp|133961 

herj65]77jgp|88411 

20 The intermediary file lists the name of the sequence, followed by the 

starting and end point in the query sequence (corresponds to where the two sequences 
matched), the semantic class of the name (protein, gene or protein/gene). The last number 
is not considered. 

The intermediary file is then scanned by another Perl program (Appendix 
25 G). This program compares the starting end points with the actual text, making sure that 
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the matched name is an 'autonomous' entity in the query text. For example, while "per" in 
" per gene" should be recognized as a gene name, "per" in "personal" should not be 
recognized as a gene name. The program recognizes other characters than the space 
character delimiting an 'autonomous' gene or protein name. In addition, the script looks 
for plurals of words. For example, " interleukins" should be recognized as a protein name, 
although only the singular form, "interleukin", is in the database. 

The final result consists of the original journal article with XML tags 
surrounding the gene and protein names. This is done using the same script as in 
Appendix G: 

blocked <phr sem="gp">T cell antigen receptor</phr> (TCR)- and <phr 
sem="gp">CD28</phr>-mediated <phr sem="gp">IL-2</phr> gene transcription. 
Therefore, <phr sem="gp">Rapl</phr> functions as a negative regulator of. . . 

To adapt the problem to BLAST'S statistical foundation, different 
measures were undertaken to limit the output to the most relevant gene and protein 
names. 

BLAST is sensitive to the search space the program works in. Thus, given 
a long query sequence and a large sequence database, matches have a lower statistical 
significance because the chances are higher that the matches could have occurred by 
chance alone. In addition, matches with few letters have a lower statistical significance 
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than matches with many letters. In order to find all true matches with any significance 
level, some measures were undertaken to address this problem. For example, (i) the 
query sequence was divided into 10 equal length parts, i.e., the journal article was divided 
into 10 parts and 10 different queries are run on each part separately; (ii) the sequence 
5 database (with the gene and protein names) is separated into 5 databases, each containing 
protein/gene names of different length; (iii) gene and protein names with less than 3 
letters in the database were 'expanded', i.e., spaces were added at the beginning and the 
end of the name. Doing so, the statistical significance of a match containing a short name 
was higher. A space does not only include an empty character. For example, a gene name 

10 "k4" could occur in a journal article as "kinin 4 (k4)". It was therefore important to define 
several characters as substitutes for a space character. The alphabet in Appendix E defines 
the nucleotide combination ATCC as such a substitute. 

Working with nucleotides implies that errors involving reading frames 
must be addressed. For example, working with a code of four letters, the nucleotide 

1 5 combination ATCTGTC ACG could mean ATCT/GTC A or TCTG/TCAC or 

CTGT/CACG . Since the text is translated into a nucleotide combination, only one of 
these possibilities is correct. But BLAST can not distinguish between these solutions, i.e., 
BLAST would potentially match a database sequence to a wrong reading frame in the 
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query sequence, producing many nonsense results that could compromise the significance 
of true results. 

The solution to this problem is a comma- free code. A comma free code 
knows only one correct reading frame. BLAST therefore does not produce any nonsense 
5 results. A comma- free code consists of only one permutation of a nucleotide combination. 
For example, given the nucleotide combination ATCC and its permutations CATC, 
CCAT and TCCA, only ONE of these permutations would be included in a comma- free 
code. The code in Appendix E does represent a comma free code. Comma- free codes 
were discussed in the early days of DNA research (Crick et al, Proc. Natl. Acad. Sci. 
10 43:416-421). 

In order to fine-tune the matching process, different BLAST parameters 
must be adjusted, for example: word size (which sets the size of the high scoring words, 
thus influencing the sensitivity of finding RSVs);mismatch penalty (exact vs approximate 
matching); numbers of alignments to show (true matches of low significance can 
15 sometimes be at the very end of the BLAST output, therefore many alignments have to be 
shown); and expectation value (which sets the significance value for matches in the 
output file). 
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5.3. GENERATION OF SPECIALIZED DATABASES 
In accordance with the present invention, specialized databases may be 
developed that contain information derived from unpublished data, publications such as 
research articles, theses, posters, abstracts, etc. and/or databases concerning interactions 
among genes and proteins, their domain/motif structure, and their biological functions. 

For example, but not by way of limitation, a specialized database may be 
prepared as follows. Protein and gene sequences may be provided, for example, by the 
Java program PsiRetrieve which allows for quick retrieval of protein or nucleotide 
sequences from binary BLAST databases by sequence accession number, keyword or 
groups of keywords, or species name. In addition, using the program PsiRetriever, 
sequences encoding the proteins of interest may be retrieved from the non-redundant 
(NCBI) database of protein sequences and stored as a FASTA file. The FASTA file is 
then converted into a binary blast database using the program FORMATDB from the 
BLAST suit of programs. 

Known motifs/domains for proteins may also be collected using the flat 
file versions of major protein databases, such as SwissProt (http://expasy.hcage.ch/sprot) 
and the non-redundant database of NCBI (http://www3.ncbi.nlm.nih.gov). The databases 
can be downloaded and searched for the keywords "motif and "domain" in the feature 
tables of proteins. In addition, existing databases of motifs and domains, such as 
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BLOCKS (http ://dupsas. Weizmann.acil/bcd/bcdparent//databanksblocks/hfrnl) and 
pfam(http://www.sanger.ac.uk//software/pfam; http://pfm.wustl.edu), can be downloaded 
(Henikoff et al., 1991, NAR 19:6565-6572). Still further, it is understood that any 
publically available database containing gene/protein sequences may be utilized to 
generate the specialized databases for use in the practice of the present invention. 

Homologous sequences may be aligned using, for example, the 
CLUSTALW program (Higgins, et al. 1996 Methods in Enzymology 266: 383-402). A 
protein's sequence corresponding to each domain/motif can be identified, saved and used 
for building a Hidden Markov Model (HMM) of the domain/motif using a HMMER and 
HMMER2 packages (see, Durbin, R. et al. 1998 in Biological Sequence Analysis: 
Probablistic Models of Proteins and Nucleic Acids). HMMER and HMMER2 packages 
are useful for (i) building HMMs from sets of aligned protein or nucleotide sequences, 
and (ii) comparing the HMMs with sequence databases aimed at identifying significant 
similarities of HMMs with database sequences. Both nucleotide and protein databases 
can be used for this purpose. Alternatives to the Hidden Markov Model method for 
building domain/motif models include neural network motif analysis (Wu, C.H. et al., 
1996, Comput Appl Biosci 12, 109-18; Hirst, J.D., 1991, Protein Eng 4:615-23) and 
positional weight matrix analysis (Claverie, J.M., 1994, Comput Chem 18:287-94; 
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Venezia, D., 1993, Comput Appl Biosci 9:65-9; Bucher, P. 1996, Comput Chem, 20:3- 
23; Tatusov, R.L., 1994, Proc Natl Acad Sci USA 91:12091-5). 

Once a comprehensive collection of motif s/domains is created, each 
particular protein may be compared against a complete database of HMMs to identify 
known motifs and domains. 

The Hidden Markov Model (HMM) is built using the following steps: 

Al . Start with a motif/domain name and a single amino acid sequence 

representing a domain or motif. 
A2. Do PSI-BLAST (BLASTPGP) search with the motif/domain sequence 

against a protein non-redundant database. 
A3. Retrieve the sequences identified in the database search from the protein 

sequence database. Exclude low-complexity sequences, short or 

incomplete sequences and sequences with similarity score above a selected 

threshold of PPD value <0.001 
A4. Align the set of sequences with CLUSTALW (or other multiple sequence 

alignment program). 
A5 . Use the set of aligned sequences for building HMM with the programs 

provided with HMMER and HMMER2 packages (see Hughey and Krogh 

1996, J. Mol. Biol. 235:1501-1531). 
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A6. Do a new database search comparing new HMM with the non-redundant 
protein database. 

A7. Continue steps A3-A6 until the convergence of the Markov model i. e. , 

until no new sequences are identified, or the maximum allowed number of 
iterations as defined by the user is reached. (Hugh R. and Krogh A., 1996, 
Comput. Appl Biosci. 12: 95-107). 

In addition, in yet another embodiment of the invention, a specialized 
database may be designed to contain a semantic model of proteins and of the possible 
interactions between them. Such databases are particularly useful for computation and 
analysis of regulatory networks between proteins. The semantic model is designed for 
representing substances, such as proteins and actions between them, and is based on 
widely accepted principles of object-oriented programming languages such as Java. 
Figure 3 is a diagram illustrating the object representation of molecules and relations 
between them. As indicated in Figure 3 there are six major classes, corresponding to the 
top-level classification of objects and actions: (z) a substance; (ii) a state of a substance; 
(Hi) a similarity between substances; (zv) an action between substances; (v) a result of the 
action; and (vz) a mechanism that enables an action. 

Figure 3 presents the class design graphically, listing the variables that 
represent the properties of each class or class object in the implementation. Classes can 
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be made nested via the mechanism of "inheritance", i.e., classes are defined starting with 
the most general ones and moving towards more specific classes. Definition of more 
specific classes is simplified because the properties of the general classes are "inherited" 
by the specific classes and need not be redefined each time (see, Flanagan 1997, Java in a 
Nutshell, Second Edition. O'Reilley & Associates, Inc. Sebastopol, CA). 

As shown in Figure 3, the two key object types in this scheme are 
substances (nodes of the graph representing regulatory networks) and actions (oriented 
edges connecting pairs of nodes), while result and mechanism objects are auxiliary to 
object action. Each substance object is characterized with a state. In this scheme, action 
is the most complicated object; each action object is characterized by a specific pair of 
substances participating in the action, one of which can be active and is referred to as 
Subject Substance and the second of which can serve as a substrate for the former and is 
referred to as Object Substance. Furthermore, for each action the initial and final states 
corresponding to interacting substances are defined. The property Time Required of each 
Action Object allows the setting of different durations for different actions (time is 
measured in relative units; see Rene Thomas and Richard D'Ari, 1990, "Biological 
Feedback," CRC Press Boca Raton, Ann Arbor, Boston). 



NY02-257310.1 



A31869A-70050.0881 



Once developed, the specialized databases can be used to identify novel 
genes based on computation and analysis of phylogenetic trees for multigene families and 
analysis of homologous regulatory networks. 

In a specific embodiment of the invention, a specialized database was 
5 generated using a set of keywords defining proteins involved in apoptosis (see, Figure 4). 
The specialized sequence database was referred to as Apoptosis 3. As a first step in 
generating the specialized database, a comprehensive set of articles describing the system 
of apoptosis or programed cell death was compiled. The articles were analyzed and 
information on regulatory pathways characterizing apoptosis from a variety of different 
10 organisms was extracted. Such pathways included those involved in MHC-T cell 

receptor interactions, inflammatory cytokine signal transduction, induction by light, y- 
radiation, hyperosmolarity or heat shock, pathways involving immunoregulatory 
receptors or receptors having cytoplasmic domains, integrin-related pathways and 
perforin/granzymep related pathways. The collected information was stored using 
1 5 Powerpoint (Microsoft) as a collection of graph/plots depicting the regulatory pathway. 
In addition, a list of proteins relevant to regulation of apoptosis was compiled. 

Using the program Psi Retriever, sequences encoding the proteins relevant 
to regulation of apoptosis were retrieved from the non-redundant (NCBI) database of 
protein sequences and stored as a FASTA file. The FASTA file was then converted to a 
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binary blast database using the program FORMATDB from the BLAST suit of programs. 
The BLAST suit of programs provides a set of programs for very fast comparisons of a 
single sequence to a large database. Both the database and the search or query sequence 
can be any combination of nucleotide and/or amino acid sequences. 

In a working example described herein, the Apoptosis 3 database was 
used to compare genomic and cDNA sequences derived from the 13q region of human 
chromosome 13. This region of the chromosome is associated with Chronic 
Lymphocytic Leukemia (CLL). Using this method of analysis a human gene with 
significant homology to the mouse Rptl gene was identified. When the activity of Rptl 
is knocked out in mice, the regulatory effect is manifested as a block in T-lymphocyte 
apoptosis. This result indicates that the identified human Rptl homology may represent 
the gene in which genetic defects lead to CLL. 

The amino acid sequence of the human Rptl gene is presented in Figure 
15. The present invention relates to nucleic acid molecules encoding the human Rptl 
protein shown in Figure 15. The invention also relates to nucleic acid molecules capable 
of hybridizing to a nucleic acid molecule encoding the human Rptl protein presented in 
Figure 15 under conditions of high stringency. By way of example and not limitation, 
procedures using such conditions of high stringency are as follows: Prehybridization of 
filters containing DNA is carried out for 8 hours to overnight at 65 °C in buffer composed 
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of 6x SSC, 50 mM Tris-HCl (pH7.5), ImM EDTA, 0.02% PVP, 0.02% Ficoll, 0.02% 
BSA and 500 mg/ml denatured salmon sperm DNA. Filters are hybridized for 48 h at 
65 °C in prehybridization mixture containing lOOmg/ml denatured salmon sperm DNA 
and 5-20 x 10 6 CpM of 32 P-labeled probe. Washing of filters is done at 37°C for 1 h in a 
solution containing 2x SSC, 0.01% PVP, 0.01% Ficoll and 0.01% BSA. This is followed 
by a wash in 0.1 x SSC at 50°C for 45 minutes before autoradiography. Other conditions 
of high stringency which may be used are well known in the art. 

5 .4. GENE DISCOVERY THROUGH PHYLOGENETIC 
ANALYSIS OF GENE FAMILIES 

The present invention provides a method for identifying novel genes 
comprising the following steps: (i) comparing a single sequence with a database; (ii) 
processing the output into a sequence alignment; (iii) computing gene trees; and (iv) 
analyzing the trees to predict the existence of undiscovered genes. 

Figure 5 shows a "species tree/' a "gene tree" and a "reconciled tree". A 
"species tree", as defined herein, is a graph depicting the correct order of speciation 
events leading to a set of present day species as defined by taxonomy. A "gene tree" is a 
graphical representation of the evolution of a gene from a single ancestral sequence in a 
common progenitor to a set of present-day sequences in different species. Where gene 
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duplication has occurred, a branch is bifurcated. The branch lengths of a gene tree are 
most frequently measured either in terms of the number of amino acid or nucleotide 
replacements per site or in terms of millions of years (absolute geological time). In the 
former case, the average replacement rate in the majority of the published trees varies 
among tree branches, and the root-to-tip distances are different for different present day 
sequences. In the latter case, all root-to-tip distances are equal and the height of each 
interior node of the tree corresponds to the absolute geological time passed since the gene 
duplication corresponding to the interior node took place. 

If a gene is unique, i.e., represented with a single copy per genome rather 
than being a member of a family of similar genes, the correct gene tree depicting the 
origin of this gene in a few different species is identical to the species tree. In many 
instances, a single ancestral gene has been duplicated repeatedly during evolution to form 
a multigene family. A gene tree is constructed from a gene as it occurs in several species 
and reflects both speciation events and gene duplications within the same genome. Two 
homologous genes taken from different species that originated from the nearest common 
ancestor by speciation are referred to as orthologs, while any two genes that originated 
from the common ancestor via a series of events involving intragenomic duplications, or 
conversions, are called paralogs. The terms "ortholog" and "paralog" are applied to both 
nucleic acid and proteins herein. 
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If gene deletions are forbidden and all genes for all species represented in 
the tree are known, the gene tree can be reconfigured to recapitulate the species tree, such 
that each subtree contains only orthologous genes. This tree is referred to as a reconciled 
tree and is shown in Figure 5. Imperfect gene trees which contain incorrect or partial 
species subtrees can be used to build reconciled trees that indicate events of speciation, 
gene loss, and gene duplication. 

Orthologs from different species in gene trees are usually clustered 
together, so that if all the existing homologous genes from different species were known, 
the same relationship of species would be recapitulated in each cluster of orthologous 
genes. Since in reality a considerable number of genes are not yet identified, the real gene 
trees contain incomplete clusters of orthologs that can be used for identification of the 
missing genes. 

By applying phylogenetic analysis, i.e., reconstruction of gene trees of 
gene/protein sequences, one can predict the existence of undiscovered genes in humans 
and other species in addition to identifying the function of a gene. Such a technique is a 
significantly more powerful tool for identification of new genes than mere sequence 
comparisons. 

Methods of computing gene trees from a set of aligned sequences include 
the : (i) heuristic method based on an optimization principle which is not directly 
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motivated by a probability model (Fitch, 1974 J. Mol. Evol. 3:263-268)), (ii) the 
maximum likelihood method (Goldman, 1990, Syst. Zool. 30:345-361; Yang et al, 1995, 
Syst. Biol. 44:384-399; Felsenstein, J., 1996, Methods Enzymol. 266-418-427); and (iii) 
the distance matrix tree making method (Saito, N. and Nei, M., 1987, Mol. Biol. Evol. 
4:406-425). Since the data analyses of orthologs and paralogs often involve very 
distantly related sequences, the maximum likelihood method is preferably used for small 
data sets and the distance-matrix method in other instances. 

To construct a reconciled tree according to the invention, the first step 
comprises a search for homologs in a publicly or privately available database such as, for 
example, GenBank, Incyte, binary BLAST databases, Swiss Prot and NCBI databases. 
Following the identification of homologous sequences a global alignment is performed 
using, for example, the CLUSTALW program. From the sequence alignment a gene tree 
is constructed using, for example, the computer program CLUSTLAW which utilizes the 
neighbor-joining method of Saito and Nei (1997, Mol. Biol. Evol. 4:406-425). 
Construction of a species tree is then retrieved from, for example, the following web site: 
http://ww3.NCBimM.Nm.G0V//taxomy.tax.html. 

The species tree and gene tree are given as input into the algorithm 
described below, which integrates both trees into a reconciled tree. Agreement between 
the gene tree and the corresponding species tree for any given set of sequences indicates 
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the identification of orthologs. In contrast, disagreement between the species and gene 
tree suggest a gene duplication that resulted in the formation of a paralog. Thus, through 
generation of a reconciled tree one can identify orthologs present in one species but 
missing in another. These can be deduced by forming subtrees of orthologs in a gene 
tree, and then comparing the subtree in the gene tree with a species tree. A missing gene 
appears as a branch present in the species tree but absent in the gene tree. 
The algorithm for defining an orthologous gene subtree and predicting the undiscovered, 
or lost in evolution, genes is as follows: 

Let T g be the most likely gene tree identified with one of consistent tree- 
making methods from a set of properly aligned homologous genes {1,2, s}, such that 
one or more homologous genes from every species corresponds to pending vertices of T g . 
Each gene is labeled with the species it comes from (l,...,s) adding subscripts to 
distinguish homologous genes from the same species whenever it is necessary. Let T g be 
the true species tree (tree correctly reflecting speciation events which we assume to be 
known) for species {1, 2, s}. Due to the biological meaning of T s each species in this 
tree is represented only once. It is assumed that both T s and T g are binary, although it is 
straightforward to extend the algorithm described here to the case of multifurcated trees. 
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Algorithm 

Al . For each pair of interior nodes from trees T g and T s , compute similarity o(S JS^) . 
A2. Find the maximum o(S JS gj ) . 

A3. Save S gl as a new subtree of orthologs, save {S gl } - {S SJ } as a set of species that 



A4. Eliminate S gl from T g ; T g : = T g \S gl . 
A5. Continue A2 - A4 until T g is non-empty. 
The following definitions apply: 

Let S gl be an rth subtree of T g (corresponding to the ith interior node), correspondingly, let 

10 S SJ be yth subtree of tree T s . 

Let {S gl } stand for an unordered set of species represented in S gl such that each species is 
represented exactly once, and let | {S } \ and { |5^| } be the number of entries in {S gI } 
and the number of pending vertices in S gI? respectively. Define by S sl (S gl ) the unique 
subtree of S SJ that has leaves labeled exclusively with species from | {S } | , so that each 

15 element of \fS }\ is used i.e., that is, the unique subtree obtained by eliminating from 
S SI all species that are not present in | {S } | . 

Then define similarity measure, a, between S gl and S S1 in the following way: 



5 



are likely to have gene of this kind (or lost it in evolution). 



o(SS) = 0 if |S | * | {SJ | , or S S1 (S gl ) ^ S &9 and 




S 
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The support of tree clusters by data can be measured using the bootstrap technique 
described in Felsenstein (1985, Evolution 39:783-791). 

In an embodiment of the invention, the human antiquitin gene was 
identified using phylogenetic analysis. The aldehyde dehydrogenase gene family in 
humans can be subdivided into at least ten ancient subtrees characterized by different 
functions of corresponding proteins. These genes probably arose from a series of gene 
duplications of an ancestral gene which took place before the divergence of a common 
ancestor of Eukaryotes and Eubacteria. 

The aldehyde dehydrogenase gene cluster is highlighted in Figure 6 which 
shows the original tree of ALDH sequences, the circled area indicating a sequence cluster 
where bacterial {Bacillus subtilis), plant (Brassica napus), and nematode 
(Caenorhabditis elegans) ortholog is present, but a human ortholog is not known. A 
random screening of cDNA libraries showed that a human ortholog, referred to as 
antiquitin, does exist. Figure 7 shows the same gene tree as in Figure 6 with an additional 
human protein referred to as antiquitin present in the tree. 

In yet another embodiment of the invention, a human ortholog of the 
murine Max-interacting transcriptional repressor Mad3 was identified through 
phylogenetic analysis of a gene family. The gene tree was constructed as follows. The 
protein sequences of known members of the Mad gene family were extracted from 
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GenBank database. The extracted sequences were aligned using multiple alignment 
program CLUSTALW running on Sun SPARC station. Redundant and non-homologous 
sequences as well as distant homologs from S. cerevisiae, C. elegans, D. melanogaster 
etc. were removed from the alignment. The refined set of sequences were realigned with 
5 CLUSTALW and a gene tree as presented in Figure 18A was computed. To identify a 
human ortholog of the Mad3 protein, a human dbEST at NCBI was searched with 
program TBLASTN using mouse Mad3 protein sequences as a query. Two highly 
homologous ESTs were identified and are presented in Figure 17A. To obtain a complete 
coding sequence a search was conducted to obtain overlapping sequences in dbEST. The 

10 search for overlapping sequences was performed using the program Iterate with EST 
Zs77e55.rl (gb/AA278224) as the search query. The search identified a single 
overlapping sequence. The search for overlapping sequences was performed using 
program Iterate with EST zs77e55.rl (gb/AA278224) serving as a query. The search 
returned a single overlapping sequence, namely HUMGS 00 12279 (dbj/C02407), thus 

15 showing that the two EST sequences found during the initial TBLASTIN search belong 
to the same gene. The complete sequence of the gene was assembled from the two ESTs 
using commercially available sequence assembly program SeqManl 1(DNASTAR Inc., 
WI). The nucleotide sequence of the human Mad3 gene is presented in Figure 17B. The 
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deduced amino acid sequence of which is presented in Figure 17C. The complete DNA 
sequence is also shown. 

The present invention relates to nucleic acid molecules encoding the 
human Mad3 protein shown in Figure 17C. The invention also relates to nucleic acid 
5 molecules that hybridize to the nucleic acid molecule of Figure 17B under conditions of 
high stringency and encode a Mad3 protein. By way of example and not limitation, 
procedures using such conditions of high stringency are as follows: Prehybridization of 
filters containing DNA is carried out for 8 hours to overnight at 65°C in buffer composed 
of 6x SSC, 50mM Tris-HCl (pH7.5), ImM EDTA, 0.02% PVP, 0.02% Ficoll, 0.02% 

10 BSA and 500 mg/ml denatured salmon sperm DNA. Filters are hybridized for 48 hours 
at 65 °C in prehybridization mixture containing 100 mg/ml denatured salmon sperm DNA 
and 5-20 x 10 6 CpM of 32 P-labeled probe. Washing of filters is done at 37°C for 1 hour in 
a solution containing 2x SSC, 0.01% PVP, 0.01% Ficoll and 0.01% BSA. This is 
followed by a wash in O.lx SSC at 50°C for 45 minutes before autoradiography. Other 

1 5 conditions of high stringency which may be used are well known in the art. 

5.5. SIMULATION AND HYPOTHESIS TESTING 
The simulation and hypothesis testing methods of the invention, described 
in the subsections below, utilize specialized databases of gene/protein structures and 
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interactions for identifying potentially undiscovered members of multigene families 
through comparisons of regulatory networks for different species, searching expressed 
sequence tag (EST) databases, and simulation of regulatory cascades. 

5.5.1. GENE DISCOVERY THROUGH ANALYSIS 
OF REGULATORY NETWORKS 

The present invention provides a method for identifying undiscovered 
genes through comparisons of regulatory networks for different species where 
functionally similar regulatory systems are conserved. The amount of information 
available concerning regulatory genes and/or proteins in different organisms and their 
functional relationships allows one to reconstruct and compare regulatory networks. 
Since in most cases, the knowledge of all genes involved in almost any particular 
regulatory system is incomplete, a comparison of homologous networks within the same 
organism and between different species permits the identification of genes absent in a 
system under comparison. 

The identified genes, being part of a regulatory network, are implicated as 
potentially contributing to a phenotype of a disease associated with the system under 
analysis. Using the methods of the present invention these putative disease genes can be 
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cloned, mapped and analyzed for mutations directly, thereby omitting the expensive and 
time-consuming steps of positional cloning and sequencing of genomic regions. 
Gene discovery by analysis of regulatory networks is outlined in Figure 8. The analysis 
is initiated starting with a biological system (e.g., signaling pathway of genes involved in 
5 Bcl-2-regulated apoptosis in lymphocytes), a single gene (e.g., Bcl-2) or a gene family 
(e.g., caspases). 

Initially, a specialized database is generated for comparison of regulatory 
networks between different species. For example, starting with a single candidate gene in 
a single species, a typical iteration in this process begins with identification of all known 

10 proteins and genes that are upstream and downstream with respect to it in regulatory 

hierarchies and the reconstruction of a network of interacting genes and proteins. Next, 
for each protein, a set of key domains and motifs is identified and this information is used 
to search for related proteins in humans and other species. The identified sequences are 
compared and for each pair of sequences showing similarity above a certain threshold, a 

1 5 similarity object is generated. A similarity object is generated if two sequences, 
nucleotide or amino acid, show significant similarity in database searches (p value 
< 0.001). The object retains the following information: (i) reference to similar substances 
i.e., genes or proteins; (ii) significance of the similarity, similarity score and percent of 
identity; and (iii) coordinates of the similarity region within two compared sequences. 
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"Orthology objects" constitute a subset of "similarity objects" which satisfies one 
additional requirement, i.e., that two similar sequences should be identified as orthologs 
by the tree-based algorithm described above. In identifying orthologs, if gene A is 
orthologous to gene B, and gene B is orthologous to gene C, gene A is necessarily 
orthologous to gene C. 

In a specific embodiment of the invention, for each species under analysis, 
orthologous proteins or genes are identified. In a further embodiment of the invention, 
small orthologous molecules participating in a regulatory network for two or more 
species may also be identified. Where proteins, genes, or molecules are orthologs, the 
action of the protein, gene or molecule between species may be interchangeable. If more 
than two species are involved in the analysis, subtrees of orthologous substances and 
subtrees of orthologous actions are identified. 

Once orthologous genes, proteins or molecules are identified in two or 
more species, by forming a reconciled tree, for example, a set of orthologous or 
paralogous regulatory networks can be analyzed and visualized using graph theory where 
arcs represent actions and vertices represent substances. Thus, the method of the 
invention may further comprise the following steps: (i) superimposing the orthologous 
regulatory networks from two or more species and searching for the actions (arcs) and 
substances (vertices) in the homologous networks that are represented in some taxa but 
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absent in others; (ii) superimposing paralogous regulatory networks from the same taxa 
and searching for paralogous genes that are missing in some taxa; and (iii) computing a 
general regulatory network that summarizes common regulatory sequence relationships 
known for more than one species. 

In a specific embodiment of the invention a set of regulatory networks 
from different species, relating to the same biological system, apoptosis, for example, can 
be analyzed and visualized utilizing the following methods: (i) for each species functional 
information is collected relating to apoptosis; (ii) using the functional information, 
regulatory networks for each species comprised of interacting proteins and/or the genes 
involved in apoptosis are generated; (iii) the sequences of the interacting proteins and 
genes of each of the regulatory network are compared and for sequences showing 
similarity above a predetermined threshold range; and (iv) distinguishing between 
orthologs and paralogs using the methods set forth above. 

An analysis similar to that performed using subtrees of sequences may be 
applied to classify protein functions as orthologous or paralogous actions. A 
"generalized" regulatory network maybe represented as a network wherein a substance as 
it occurs in a particular species is substituted with a cluster (i.e., subtree) of orthologous 
substances among species. In the final step of the analysis the clusters within each 
species are compared to one another, to identify missing genes. 
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Figure 1 1 depicts the regulatory relationships among hypothetical proteins 
(denoted with Arabic numerals) of hypothetical species A and B. As indicated in Figure 
1 1 A, an overlay of regulatory data for two species overlaps, but not completely. As 
indicated, protein 5 is known only for species B while protein 3 is known only for species 
5 A. The proteins in different species denoted with the same numeral are considered 

orthologous. As indicated, the regulatory relationships between a pair of proteins can be 
of three different kinds. Figure 9B, 9C, and 9D represent Boolean operations, OR, AND, 
and XOR, as arcs of the two regulatory relationships depicted in Figure 9 A, the same 
operations being applicable to the set of vertices of the two regulatory relationships. 

10 In some instances, orthologous networks in two distantly related taxa may have the same 
domains but arrangement of the domains between the related taxa may be different. In 
such a case, a one-to-one correspondence between orthologous proteins in closely related 
species has to be substituted with a one-to-many relationship among domains comprised 
within the proteins. For this purpose, a similarity object may be defined operating on 

15 pairs of motifs/domains in two proteins, and substitute pairs of orthologous proteins with 
pairs of orthologous domains. After this correction, homologous networks are compared 
as described above. 

Figure 10 is a diagram representing a hypothetical example of defining 
homologous protein networks in two different species using protein motifs, the diagram 
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showing only two hypothetical proteins (lane 2) for species A and three hypothetical 
proteins (lanes 1, 3, and 4) for species B. Protein 1 in both species has motifs a and (3, 
protein 2 has motifs 5, e, and ( ? and proteins 3 and 4 have motifs 5 and (, and e, 
respectively. The motif analysis indicates that proteins 3 and 4 in species B may 
5 collectively perform the same function as protein 2 in species A. 

5.5.2 GENE DISCOVERY BASED ON PROTEIN 

MOTIF/DOMAIN SEARCHES 

The present invention provides yet another method for identifying genes 

that are homologous and perform the same or an analogous function in different species. 

10 The method of the invention comprises the following steps: (i) creating a database of 

sequences which comprise a motif or domain composition of a gene of interest using, for 
example, HMMER software; and (ii) searching additional databases for expressed 
sequence tags (ESTs) containing the domains and motifs characteristic for the gene of 
interest with HMMs of domains and motifs identified in step (i). In yet another 

15 embodiment of the invention, sequences may be searched which correspond to nucleotide 
sequences in an EST database or other cDNA databases using a program such as BLAST 
and retrieving the identified sequences. In an optional step, for each EST identified, 
sequence databases can be searched for overlapping sequences for the purpose of 
assembling longer overlapping stretches of DNA. Once identified, the ESTs can be used 
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to isolate full length nucleotide sequences comprising the gene of interest using methods 
such as those described in Section 5.4, infra. 

The general flowchart scheme for gene discovery analysis based on 
motif/domain search is shown in Figure 11. In a specific embodiment of the invention, 
the method referred to as the "phylogenetic reflection technique"comprises, first, defining 
the motif or domain composition of a gene of interest involved in a biological system of 
interest. Second, protein-coding genes from other species, including for example yeast 
and/or nematode genes, that bear a significant similarity to the gene of interest or a 
specified domain of the corresponding protein are collected. Third, the identified genes 
are in turn subjected to a "domain analysis" to establish protein motifs which might 
suggest a function of these genes using, for example, HMMER software. Fourth, the 
selected genes are in turn used for database searches in EST databases (dbEST) and/or a 
non-redundant (nr) database to identify unknown genes that are potentially orthologous to 
the selected yeast and nematode genes. Once identified ESTs having different tumor 
suppressor domains may be linked using multiple PCR primers. Using routine cloning 
techniques, well known to those of skill in the art, a full length cDNA representing the 
gene of interest can be obtained. 

Once new genes are identified by domain/motif analysis experimental 
searches may be carried out to isolate complete coding sequences and evaluate their 
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tissue- and disease-specific expression patterns. In parallel their position with respect to 
regulatory networks can be identified as described below. 

In a specific embodiment of the invention, an apoptosis related human 
gene was identified using the method described above. As a first step C. elegans genes 
5 containing either POZ or Kelch domains were identified. A Hidden Markov Model was 
developed using POZ and Kelch sequences from the Drosophila Kelch protein and any 
identified homologs. The resulting Hidden Marker Model was used to search through the 
collection of C. elegans protein sequences. One of the identified C. elegans genes 
contained a POZ domain, death domain, kinase domain and heat repeat. The presence of 
1 0 both a death domain and a kinase domain suggested that the protein functions as a 
regulatory protein. 

A human EST database was searched using the protein sequence of the 
identified C. elegans gene and two sequences were identified (Figure 14A). A gene tree 
was computed to determine whether the identified human sequences were orthologs of 
15 the C elegans gene. As depicted in Figure 14B, the human EST AA481214 appears to 

be a true ortholog of the C elegans gene. Figure 14C presents the nucleotide sequence of 
the identified death domain gene. Figure 14D presents the amino acid sequence of the 
death domain protein. 
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The present invention encompasses the nucleic acid molecule of Figure 
14C, comprising the sequence of EST AA481214 and proteins encoded by said nucleic 
acid molecule. The invention also relates to nucleic acid molecules capable of 
hybridizing to such a nucleic acid molecule under conditions of high stringency. By way 
of example and not limitation, procedures using such conditions of high stringency are as 
follows: Prehybridization of filters containing DNA is carried out for 8 hours to 
overnight at 65°C in buffer composed of 6x SSC, 50mM Tris-HCl (pH7.5), ImM EDTA, 
0.02% PVP, 0.02% Ficoll, 0.02% BSA and 500 mg/ml denatured salmon sperm DNA. 
Filters are hybridized for 48 hours at 65°C in prehybridization mixture containing 100 
mg/ml denatured salmon sperm DNA and 5-20 x 10 6 CpM of 32 P-labeled probe. Washing 
of filters is done at 37°C for 1 hour in a solution containing 2x SSC, 0.01% PVP, 0.01% 
Ficoll and 0.01% BSA. This is followed by a wash in 0. lx SSC at 50°C for 45 minutes 
before autoradiography. Other conditions of high stringency which may be used are well 
known in the art. 

5-5.3. SIMULATION OF REGULATORY CASCADES 
In an embodiment of the invention, an interactive graphical program is 
utilized for visualizing the scheme of regulatory relationships, "current" states of the 
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substances, and active and inactive actions between pairs of substances. Such a program 
can be utilized for identification of genes which are associated with a specific disease. 
Currently, disease associated genes are discovered through positional cloning methods 
which combine methods of genetics and physical mapping with mutational analysis. The 
5 present invention provides a novel method for discovering disease associated genes. 

For simulating regulatory cascades, it is assumed that the time in a simulated regulatory 
system advances in discrete "quanta," or periods of time. The "state of substances" of 
the system for each discrete period of time is computed by: creating a set of substance 
objects, where a set of interactions between each created substance object is known, an 

10 initial state is specified. The time is initially set to zero. All defined actions are observed 
to confirm that the substances corresponding to the actions (i) exist, and (ii) are in the 
right initial states. Action is defined by a pair of substances that are in suitable states. 
The "subject" substance is in the inactive state, while the "object" substance can be in 
either active, or inactive, state depending on the action type. For example, the action 

15 "dephosphorylation" requires an active phosphatase ("subject" substance) and a 

phosphorylated substitute protein ("object" substance) in phosphorylated form. If both 
conditions are satisfied, the action is recorded as in progress. At termination, the 
substances must change their states as specified by the action. On each following 
"quantum" of time, the simulation proceeds in the same way while maintaining the 
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"bookkeeping" of the remaining time for each action and the remaining lifespan of each 
substance. The simulation stops when there are no more active actions available. The 
program allows editing of the properties of the objects, changing the scale and focus of 
the visualized simulation, and experimenting with the systems output. 
5 In a specific embodiment of the invention a "knock out" of a gene can be 

simulated to model the regulatory system that normally includes hypothetical gene A. 
One of the typical questions related to the gene knock out is how does the knock out 
affect a biological pathway of interest. A hypothetical example of evaluating the impact 
of a knock out of hypothetical gene A on the expression of a hypothetical gene B is 

10 shown in Figure 12. The answer to such a question could be "gene B will be inhibited" 
or "gene B will be induced" or "no effect". 

In the practice of the present invention, a simple algorithm involving 
multiplication of gene interaction "signs" along the shortest pathway between the genes 
can be used to determine the outcome. The algorithm involves the following steps: (i) 

1 5 identification of the shortest non-oriented pathway connecting genes A and B involved in 
a pathway of interest; (ii) assigning sign to gene A since it is knocked out and taking 
this sign as the initial sign value; (iii) moving along the shortest pathway between genes 
A and B, multiplying the current value of the sign with the sign of the next arc, where "-" 
stands for inhibition, "+" stands for induction or activation, and "0" stands for the lack of 
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interaction between two proteins in the specified direction; (iv) determining if the final 
result of multiplication is "0", if so eliminating the zero arc and trying to find the shortest 
oriented bypass pathway between A and B in the remaining network; otherwise stop. 
The final value of the sign at the moment of arriving at vertex B would indicate the most 
likely effect of the knock out of gene A which can be any one of the following: inhibition 
of gene B, induction/activation of gene B, or none. In addition to the "electronic knock 
out", an "electronic knock in" of a particular gene can be simulated. In such a computer 
simulation, the artificial addition of a gene and its effect on a regulatory system may be 
analyzed. 

5.6. IDENTIFICATION AND ISOLATION OF NOVEL GENES 
The present invention relates to identification of novel genes, i.e., missing 
orthologs or paralogs, and the isolation of nucleic acid molecules encoding novel genes. 
In a specific embodiment, a nucleic acid molecule encoding a missing ortholog or 
paralog can be isolated using procedures well known to those skilled in the art (See, for 
example, Sambrook et al, 1989, Molecular Cloning, A Laboratory Manual, 2d Ed., Cold 
Spring Harbor Laboratory Press, Cold Spring Harbor, New York Glover, D.M. (ed.), 
1985, DNA Cloning: A Practical Approach MRL Press, Ltd., Oxford, U.K. Vol. I, II.). 
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For example, genomic and/or cDNA libraries may be screened with 
labeled DNA fragments derived from a known ortholog orparalog from a specific species 
and hybridized to the genomic or cDNA libraries generated from a different species. For 
cross species hybridization, low stringency conditions are preferred. For same species 
5 hybridization, moderately stringent conditions are preferred. Any eukaryotic cell 

potentially can serve as the nucleic acid source for the molecular cloning of the gene of 
interest. The DNA may be obtained by standard procedures known in the art from cloned 
DNA (e.g., a DNA "library"), by cDNA cloning, or by the cloning of genomic DNA, or 
fragments thereof, purified from the desired cell. 

1 0 By way of example and not limitation, procedures using conditions of low 

stringency are as follows (see also Shilo and Weinberg, 1981, Proc. Natl. Acad. Sci. USA 
78:6789-6792; and Sambrook et al. 1989, Molecular Cloning, A Laboratory Manual, 2d 
Ed., Cold Spring Harbor Laboratory Press, Cold Spring harbor, New York): Filters 
containing DNA are pretreated for 6 h at 40 °C in a solution containing 35% formamide, 

15 5X SSC, 50 mM Tris-HCl (pH 7.5), 5 mM EDTA, 0.1% PVP, 0.1% Ficoll, 1% BSA, 

and 500 mg/ml denatured salmon sperm DNA. Hybridizations are carried out in the same 
solution with the following modifications: 0.02% PVP, 0.02% Ficoll, 0.2% BSA, 100 
mg/ml salmon sperm DNA, 10% (wt/vol) dextran sulfate, and 5-20 X 10 6 cpm 32 P-labeled 
probe is used. Filters are incubated in hybridization mixture for 18-20 h at 40 °C, and 
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then washed for 1.5 h at 55 °C in a solution containing 2X SSC, 25 mM Tris-HCl (pH 
7.4), 5 mM EDTA, and 0.1% SDS. The wash solution is replaced with fresh solution and 
incubated an additional 1.5 h at 60 °C. Filters are blotted dry and exposed for 
autoradiography. If necessary, filters are washed for a third time at 65-68 °C and 
reexposed to film. Other conditions of low stringency which may be used are well known 
in the art (e.g., as employed for cross species hybridizations). 

In another specific embodiment, a nucleic acid which is hybridizable to a 
nucleic acid under conditions of moderate stringency is provided. For example, but not 
by way of limitation, procedures using such conditions of moderate stringency are as 
follows: filters containing DNA are pretreated for 6 h at 55 °C in a solution containing 6X 
SSC, 5X Denhart's solution, 0.5% SDS and 100 mg/ml denatured salmon sperm DNA. 
Hybridizations are carried out in the same solution and 5-20 X 10 6 CpM 32 P- labeled 
probe is used. Filters are incubated in the hybridization mixture for 1 8-20 h at 55 °C, and 
then washed twice for 30 minutes at 60°C in a solution containing IX SSC and 0.1% 
SDS. Filters are blotted dry and exposed for autoradiography. Other conditions of 
moderate stringency which may be used are well-known in the art. Washing of filters is 
done at 37°C for 1 h in a solution containing 2X SSC, 0.1% SDS. 

For expression cloning (a technique commonly used in the art), an 
expression library is constructed. For example, mRNA is isolated from the cell type of 
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interest, cDNA is made and ligated into an expression vector (e.g., a bacteriophage 
derivative) such that it is capable of being expressed by a host cell (e.g., a bacterium) into 
which it is then introduced. Various screening assays can then be used to select for the 
expressed gene product of interest based on the physical, chemical, or immunological 
5 properties of its expressed product. Such properties can be deduced from the properties 
of the corresponding orthologs from other species. 

In another embodiment, polymerase chain reaction (PCR) can be used to 
amplify the desired sequence from a genomic or cDNA library. To isolate orthologous or 
paralogous genes from other species, one synthesizes several different degenerate 
1 0 primers, for use in PCR reactions. In a preferred aspect, the oligonucleotide primers 
represent at least part of the gene comprising known ortholog or paralog sequences of 
different species. It is also possible to vary the stringency of hybridization conditions 
used in priming the PCR reactions, to allow for greater or lesser degrees of nucleotide 
sequence similarity between the known nucleotide sequences and the nucleic acid 

1 5 homolog being isolated. 

Synthetic oligonucleotides may be utilized as primers to amplify by PCR 
sequences from a source (RNA or DNA), preferably a cDNA library, of potential interest. 
PCR can be carried out, e.g., by use of a Perkin-Elmer Cetus thermal cycler and a 
thermostable polymerase, e.g., Amplitaq (Perkin-Elmer). The nucleic acids being 
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amplified can include mRNA or cDNA or genomic DNA from any eukaryotic species. 
After successful amplification of a segment of a the gene of interest, that segment may be 
molecularly cloned and sequenced, and utilized as a probe to isolate a complete cDNA or 
genomic clone. 

Once identified and isolated the gene of interest can then be inserted into 
an appropriate cloning vector for amplification and/or expression in a host. A large 
number of vector-host systems known in the art may be used. Possible vectors include, 
but are not limited to, plasmids and modified viruses, but the vector system must be 
compatible with the host cell used. Such vectors include, but are not limited to, 
bacteriophages such as lambda derivatives, or plasmids such as pBR322 or pUC plasmid 
derivatives or the Bluescript vector (Stratagene). The insertion into a cloning vector can, 
for example, be accomplished by ligating the DNA fragment into a cloning vector which 
has complementary cohesive termini. 

6. EXAMPLE: USE OF SPECIALIZED DATABASES 
FOR IDENTIFICATION OF NOVEL GENES 

To test the method of using databases for gene discovery, protein sequence 

and domain/motif databases specific to two overlapping functional groupings of proteins: 

(i) proteins known to be tumor suppressors, and (ii) proteins implicated in apoptosis in 

animals were developed. 
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6.1 APOPTQSIS GENE DISCOVERY METHOD 
Identification of a putative apoptosis-related human gene began with an 
identification of all genes in C. elegans that contained either a POZ or kelch domain. A 
5 subset of these genes is shown in Figure 13. Hidden Markov Models (HMM) for the 
POZ and Kelch domains were built as follows. Starting with POZ and kelch sequences 
from the Drosophilia kelch protein (gi 1 577275) homologs were identified in other protein 
sequences using the BLASTP program. The resulting sequences showing significant 
similarity (e-value less than 0.001) were aligned using CLUSTALW program and the 

1 0 alignments were used to build Hidden Markov Models with HMMER-2 package (Krogh 
et al., 1995, :http://hmmer.wustl.edu/). A computer printout listing of HMM models of 
tumor suppressors appears as a Microfiche H to the present specification. (See, 
http://hmmer.wustledu; Chapter 2, which is incorporated by reference herein in its 
entirety, for a detailed description of HMM models) 

1 5 The resulting models were used to search through a database collection of 

C. elegans protein sequences. The domain structures of proteins having either a POZ or 
kelch domain were identified using existing collections of protein domains (e.g., see 
http://blocks.fhcrc.org/blocks/blocks release.html, http://coot.embl- 
heidelberg.de/SMART/, http://www.motif.genome.ad.jp/). 
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One of the unannotated protein-coding genes of C. elegans (corresponding protein 
accession number gi| 1 132541, see Figure 1 1) appeared to include a POZ domain, death 
domain, kinase domain, and heat repeat. A death domain is characteristic for the 
apoptosis system and a kinase domain indicates that the protein is likely to participate in 
phosphorylation of other proteins. The presence of these particular domains suggests that 
this protein is serving as a regulatory protein. 

Using the protein sequence of gi| 1 132541, the database of human EST 
sequences was searched and a number of partial human cDNA sequences representing 
potential human orthologs or paralogs of the C.elegans gi] 1 132541 were identified. 
The two closest human sequences, AA481214 and W51957, are depicted in Figure 14A. 
To determine whether the identified human sequences were orthologs or paralogs to the 
gi 1 1 132541 gene of C elegans, a gene tree (Saito and Nei, 1997, Molecular Biol. Evol. 
4:406-425) was computed. The gene tree was generated using homologous genes 
identified with a BLASTP search against NCBI non-redundant database, using the human 
EST AA481214 sequence as a query. The resulting tree indicates that the identified 
human EST AA481214 represents a true ortholog of the C.elegans gene gi 1 1 132541 
(Figure 14B). The nucleotide sequence of the death domain protein is shown in Figure 
14C, as well as the deduced amino acid sequence presented in Figure 14D. 
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6.1.2 APOPTOSIS GENE DISCOVERY METHOD 
As a first step in identifying a novel gene involved in apoptosis, a 
comprehensive set of articles describing the system of apoptosis/programmed cell death 
in different species was compiled using the keyword "apoptosis". By analyzing the 
5 articles, information on regulatory pathways characterizing this system in different 
species, i.e., C. elegans, mouse, fruit fly, chicken, and human, was extracted. The 
regulatory information was stored as a collection of schemes produced in PowerPoint 
(Microsoft). Figure 4 shows a set of keywords defining proteins involved in apoptosis 
pathways. The keywords were used to generate a specialized sequence database, referred 

10 to as Apoptosis3, utilizing the PsiRetriever program for extraction of proteins from the 
all-inclusive non-redundant GenBank database (NCBI). Using program PsiRetriever, 
sequences from the non-redundant (NCBI) database of protein sequences, were retrieved 
and stored as a FASTA file. The FASTA file was then converted into binary blast 
database using program FORMATDB from the BLAST suit of programs. 

1 5 Genomic and cDNA sequences located in the region of human 

chromosome 13q were compared with the Apoptosis3 database using BLASTALL 
program from BLAST program complex. This region of the human genome is associated 
with Chronic Lymphocytic Leukemia (CLL). The comparison revealed significant 
similarity between a CLL region open reading frame and the mouse RPT1 protein 
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(sp|P15533|RPTl) (Figure 13). Analysis of regulatory functions of RPT1 in the mouse 
reveals that this gene functions as a repressor of the interleukin 2 receptor (IL-2R) gene. 
When the RPT1 gene is knocked out, the regulatory effect is manifested as a block of the 
apoptotic pathway in T lymphocytes resulting in an accumulation of T lymphocytes in 
5 blood. This result is consistent with aberrations observed in CLL, namely abnormal 

accumulation of B-cells in the blood ( Trentin L. et al. ? 1997, Leuk. Lymphoma 27:35-42) 
and mutations in the human RPT1 gene play a role in development of CLL. 



6. 1 .3 EXAMPLE: A DISCOVERY OF A HUMAN ORTHOLOG OF THE 

MURINE MAX-INTERACTING TRANSCRIPTIONAL REPRESSOR 

1 0 The family of Myc proto-oncogenes encodes a set of transcription factors 

implicated in regulation of cell proliferation, differentiation, transformation and 
apoptosis. C-Myc null mutations result in retarded growth and development of mouse 
embryos and are lethal by 9-10 day of gestation. In contrast, overexpression of Myc genes 
inhibits cell differentiation and leads to neoplastic transformation. Moreover, 

1 5 deregulation of Myc expression by retroviral transduction, chromosomal translocation or 
gene amplification is linked to a broad range of naturally occurring tumors in humans and 
other species. 
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Another protein, called Max, is an obligatory heterodimeric partner for 
Myc proteins in mediating their function as activators of transcription during cell cycle 
progression, neoplastic transformation and programmed cell death (apoptosis). In order to 
make an active transcription factor the Myc proteins must form heterodimers with Max 
protein. This interaction with Max protein is necessary for specific binding of Myc with 
CACGTG box (or related E-boxes) on DNA and for activation of promoters located 
proximal to the binding sites. 

Besides the Myc family of transcription factors, the Max protein forms 
complexes with another family of so-called MAD proteins: Mxil, MAD1, MAD3 and 
MAD4. Whereas MycMax complexes activate transcription, MAD'.Max complexes work 
in an opposite way repressing the transcription through the same E-box binding sites and 
apparently antagonize Mj/c-mediated activation of the same set of target genes. 

During tissue development a shift from MycMax to MAD'.Max complexes 
occurs coincidentally with the switch from cell proliferation to differentiation. The switch 
in heterocomplexes is thought to reflect a switch from activation to repression of common 
genes leading to cessation of proliferation, exiting the cell cycle and the beginning of cell 
differentiation. In differentiating neurons, primary keratinocytes, myeloid cell lines and 
probably other tissues the expression of different MAD:Max complexes appear in 
sequential order during the transition from cell proliferation to differentiation. The MAD3 
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expression appears first and it is restricted to proliferating cells prior to differentiation 
where it is co-expressed with two different member of My c family, z-Myc or N-Myc. 
Mxil transcripts are detected in proliferating and differentiating cells whereas MAD1 and 
MAD4 were confined to post-mitotic cells. Because Myc expression is not always 
downregulated in post-mitotic cells, co-expression of Myc and MAD genes may result in 
competition for Max heterodimers thus providing promoting or inhibitory effect on cell 
proliferation. 

The gene expression patterns, along with ability of Mad proteins to 
suppress Myc-dependent transformation, are consistent with a potential function of Mad 
genes as tumor suppressors. This view is supported by the fact that allelic loss and 
mutations were detected at the Mxil locus in prostate cancers (Eagle et al, 1995 Nat 
Genet 9:249-55). Cloning of the murine proteins Mad3 and Mad4 as well as their 
relation to Max signaling network was described by Hurlin (Hurlin PJ, et al, 1995, 
EMBO J. 14:5646-59) and Queva (Queva et al. 1998 Oncogene 16:967-977). Human 
orthologs of Mad4, Madl and Mxil are known. 

In this example, the discovery of an unknown human ortholog of Mad3 
protein found "in silico" by means of phylogenetic analysis of known mouse and human 
members of the Mad gene family and database searches is described. Since the function 
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of murine Mad3 as a Max- interacting transcriptional repressor of Tkfyc-induced neoplastic 
transformation is well described, we can assign the same function to its human ortholog. 
The gene tree shown in the Figure 20 was constructed in the following way. The protein 
sequences of known members of Mad gene family were extracted from GenBank 
5 database using NCBI Entrez keyword searches. The extracted sequences were aligned 
using multiple alignment program Clustalw running on Sun SPARC station. The quality 
of the multiple alignment was checked using program Hit Viewer Iterate (A. Rzhetsky, 
available upon request) and the redundant, non-homologous sequences as well as distant 
homologs from S. cerevisiae, C. elegans, D. melanogaster etc. were removed from the 

1 0 alignment. The refined set of sequences was realigned with Clustalw and a gene tree as 
presented in Figure 15A was computed from the alignment using program NJBOOT 
(http://genome6.cpmc.columbia.edu // andrey) running on Sun SPARC station and 
viewed with program TreeView (http://genome6.cpmc.columbia.edu // andrey). 

The tree presented in Fig.l9A clearly shows the relationships between 

1 5 three known mouse genes and their two human homologs. Attempts to find a missing 
human ortholog of the mouse Mad3 gene in protein non-redundant database at NCBI 
using BLAST search did not identify any human homologs other than sequences that 
were already present on the tree, confirming the absence of a known human ortholog for 
Mad3 protein in the database. 
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In order to identify a human ortholog of the Mad3 protein, a human 
dbEST at NCBI was searched with program TBLASTN using Mad3 protein sequence as 
a query. Two EST were identified and are shown in Figure 17A. 

Due to the nature of dbEST database this search produced only partial 
5 sequences of potential candidate genes. To obtain complete coding sequences (complete 
cds) of the genes, a search was conducted to obtain overlapping sequences in dbEST. The 
search for overlapping sequences was performed using the program Iterate with EST 
zs77e55.rl (gb|AA278224) serving as a query. The search returned a single overlapping 
sequence, namely HUMGS0012279 (dbj|C02407), thus indicating that the two EST 
10 sequences found during the initial TBLASTN search belong to the same gene. 

The complete sequence of the gene was assembled from the two ESTs using 
commercially available sequence assembly program SeqManll (DNASTAR Inc., WI). The 
nucleotide sequence of the human Mad3 gene is presented in Figure 17B. The deduced 
amino acid sequence of the gene is presented in Figure 1 7C. The translated sequence consists 
15 of 206 amino acid residues 81% of which are identical to mouse Mad3 protein. The 
alignment of human and mouse Mad3 proteins shown below was made using BLAST server 
at NCBI and is presented in Figure 17C. 

Multiple alignment of the new sequence with sequences of known Mad 
proteins was made using Clustalw and viewed with the HitViewer. A gene tree was 
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computed from this alignment using NJBOOT. Multiple alignment of the new sequence 
with sequences of known Mad proteins (Figure 17C) along with its position on gene tree 
(Figure 18B) shows that this new human gene found by the approach described above 
belongs to the family of Mad proteins and is the ortholog of mouse Mad3. 
5 The present invention is not to be limited in scope by the specific 

embodiments described herein, which are intended as single illustrations of individual 
aspects of the invention, and functionally equivalent methods and components are within 
the scope of the invention. Indeed, various modifications of the invention, in addition to 
those shown and described herein will become apparent to those skilled in the art from 
1 0 the foregoing description and accompanying drawings. Such modifications are intended 
to fall within the scope of the appended claims. 

Various publications are cited herein, the contents of which are hereby 
incorporated by reference in their entireties. 
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WE CLAIM : 

1 1 . A method for identifying a novel nucleic acid molecule encoding a 

2 protein of interest comprising: 

3 (i) selecting a specific protein from a first species involved in a 

4 regulatory network of interest; 

5 (ii) identifying known proteins that act upstream and 

6 downstream in the regulatory network of interest with respect 

7 to the specific protein selected; 

8 (iii) constructing the regulatory network of interest from the 

9 proteins identified in step (ii); 

1 0 (iv) for each identified protein, select a domain or motif and 

1 1 search by homology for related proteins in a second species, 

12 wherein a related protein is defined as a protein having a 

1 3 homologous domain or motif; 

1 4 (v) producing a regulatory network for the second species, 

1 5 wherein said regulatory network incorporates the identified 

16 related proteins; 
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1 7 (vi) comparing the regulatory network from the first species to 

1 8 the regulatory network of said second species; 

1 9 (v) identifying a protein present in a regulatory network for one 

20 species but absent in the regulatory network of the other 

21 species; and 

22 (vi) isolating a nucleic acid molecule encoding the protein 

23 identified in step (v) in the species in which it is missing. 

1 2. The method of Claim 1 wherein the nucleic acid molecule encodes 

2 human protein. 

1 3. The method of claim 1 wherein the related proteins are orthologs. 
1 

2 4. The method of claim 1 wherein the regulatory pathway is involved in 

3 apoptosis. 

1 5. The method of claim 1 wherein the specific protein from the first 

2 species is involved in tumor suppression. 
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1 6. A method for identifying the affect of a gene knockout on a regulatory 

2 pathway comprising the following steps: 

3 (i) identification of the shortest non-oriented pathway 

4 connecting two gene products; 

5 (ii) assigning an initial sign value of"-" to the knockout since the 

6 knockout gene product is inactive; 

7 (iii) moving along the shortest pathway between the two gene 

8 products multiplying the sign with the sign of the next gene 

9 product in the pathway, wherein "-" stands for inhibition, "+" 

10 stands for induction or activation, and "0" stands for the lack 

1 1 of interaction between two proteins in the specified direction; 

12 and 

13 (iv) determining the final sign at the end of the pathway, wherein 

14 "-" indicates inhibition and "+" indicates induction or 

1 5 activation of the pathway. 

7. A method for identifying a novel nucleic acid molecule encoding a 
protein of interest comprising: 
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(i) selecting a gene of interest and searching a database for 
homologous sequences; 

(ii) aligning the homologous sequences identified in step (z); 

(iii) constructing a gene tree using the sequence alignment; 

(iv) constructing a species tree; 

(v) imputing the species tree and gene tree into an algorithm 
which integrates the species tree and the gene tree into a 
reconciled tree; and 

(vi) identifying orthologous genes present in one species but 
missing in another. 



1 8. The method of claim 7 wherein the following algorithm is used to 

2 integrate the species tree and the gene tree into a reconciled tree: 

3 (i) computing the similarity o(S JS ) for each pair of interior 

4 nodes from trees T g and T s , 

5 (ii) finding the maximum o(S JS ) ; 

6 (iii) saving S gl as a new cluster of orthologs, save {S gl } - {S SJ } as 

7 a set of species that are likely to have gene of this kind (or 

8 lost it in evolution); 
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9 (iv) eliminating S gl from T g ; T g : = T g \S gl ; 

1 0 (v) repeating step (ii)-(iv) until T g is non-empty. 

11 9. A method for identifying a novel gene comprising the following 

12 steps: 

1 3 (i) defining a motif or domain composition of a gene of interest; 

14 (ii) searching for sequences which correspond to nucleotide 

1 5 sequences in an expression sequence tag database or other 

16 cDNA databases using a program such as BLAST and 

1 7 retrieving the identified sequences; 

1 8 (iii) searching additional databases for expressed sequence tags 

19 containing the domains and motifs characteristic for 

20 the gene of interest with Hidden Markov Model of domains 

21 and motifs identified in step (i); 

22 (iv) identifying nucleotide sequences comprising the gene of 

23 interest. 



24 
25 
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26 overlapping sequences for the purpose of assembling longer 

27 overlapping stretches of DNA. 
28 

29 1 1 . A method for extracting information on interactions between 

30 biological entities from natural-language text data, comprising: 

3 1 (i) parsing the text data to determine the grammatical structure of the 

32 text data ;and 

33 (ii) regularizing the parsed text data to form structured word terms. 

1 12. The method according to claim 11, further comprising preprocessing 

2 the data prior to parsing, with preprocessing comprising the step of identifying biological 
1 entities. 

1 13. The method according to claim 11, further comprising referring to an 

2 additional parameter which is indicative of the degree to which subphrase parsing is to be 
1 carried out. 



1 



2 



14. The method according to claim 11, wherein said parsing step further 
comprises segmenting the text data by sentences. 
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1 15. The method according to claim 11, wherein said parsing step further 

2 comprises: 

3 segmenting the text data by sentences; and 

4 segmenting each of the sentences at identified words or phrases. 

1 16. The method according to claim 11, wherein said parsing step further 

2 comprises: 

3 segmenting the text data by sentences; and 

4 segmenting each of the sentences at a prefix. 

1 17. The method according to claim 1 1 , wherein said parsing step further 

2 comprises skipping undefined words. 

1 18. The method according to claim 1 1 , wherein said parsing step further 

2 comprises: 

3 identifying one or more binary actions and their relationships; and 
identifying one or more arguments associated with the actions. 
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1 19. The method according to claim 1 1 , further comprising performing 

2 error recovery when parsing of the text data is unsuccessful. 

1 20. The method according to claim 19, wherein said error recovery step 

2 comprises: 

3 segmenting the text data; and 

4 analyzing the segmented text data to achieve at least a partial parsing of the 

5 unsuccessfully parsed text data. 

1 21. The method according to claim 1 1 , wherein said tagging step 

2 comprises providing the structured data component in a Standard Generalized Markup 
1 Language (SGML) compatible format. 

1 22. A computer system for extracting information on biological entities 

2 from natural-language text data, comprising: 

3 (i) means for parsing the natural-language text data; and 

4 (ii) means for regularizing the parsed text data to form structured word 

5 terms. 
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1 23. The system according to claim 22, further comprising means for 

2 preprocessing the data prior to parsing, with the preprocessing means comprising 

3 identifying biological entities. 



1 24. The system according to claim 22, further comprising means for 

2 referring to an additional parameter which is indicative of the degree to which subphrase 
1 parsing is to be carried out. 

1 25. The system according to claim 22, wherein said parsing means 

2 further comprises means for segmenting the text data by sentences. 

1 26. The system according to claim 22, wherein said parsing means 

2 further comprises: 

3 means for segmenting the text data by sentences; and 

4 means for segmenting each of the sentences at identified words or phrases. 

1 27. The system according to claim 22, wherein said parsing means 

2 further comprises: 

3 means for segmenting the text data by sentences; and 
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4 means for segmenting each of the sentences at a prefix. 

1 28. The system according to claim 22, wherein said parsing means 

2 further comprises means for skipping undefined words. 

1 29. The system according to claim 22, wherein said parsing means 

2 further comprises: 

3 means for identifying one or more binary actions and their relationships; and 

4 means for identifying one or more arguments associated with the actions. 

1 30. The system according to claim 22, further comprising means for 

2 performing error recovery when parsing of the text data is unsuccessful. 

1 31. The system according to claim 22, wherein said error recovery 

2 means comprises: 

3 means for segmenting the text data; and 

4 means for analyzing the segmented text data to achieve at least a partial 

5 parsing of the unsuccessfully parsed text data. 
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1 32. The system according to claim 22, wherein said tagging means 

2 comprises means for providing the structured data component in a Standard Generalized 

3 Markup Language (SGML) compatible format. 

1 
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2 ABSTRACT OF THE INVENTION 

3 The present invention relates to methods for identifying novel genes 

4 comprising: (i) generating one or more specialized databases containing information on 

5 gene/protein structure, function and/or regulatory interactions; and (ii) searching the 

6 specialized databases for homology or for a particular motif and thereby identifying a 

7 putative novel gene of interest. The invention may further comprise performing 

8 simulation and hypothesis testing to identify or confirm that the putative gene is a novel 

9 gene of interest. The present invention also relates to natural language processing and 

1 0 extraction of relational information associated with genes and proteins that are found in 

1 1 genomics journal articles. To enable access to information in textual form, the natural 

1 2 language processing system of the present invention provides a method for extracting and 

1 3 structuring information found in the literature in a form appropriate for subsequent 

14 applications. 



NY02 257310 1 



-86- 



*■ 



A31869A 



(SHEET 1 OF 



*5 e^uence analysis 



Generate a Hidden Markov 
Chain Model for each 
motif/domain 



Align homologous 
sequences 



Compute phylogenetic 
trees, define 
onhologous and paraiogpus genes 



Identify "missing " 
(undiscovered) genes 



Identify "onhoiogous' 
and **paralcgous* 
regulatory networks 



Extract and compare regulatory 
networks 
for several species 



Define motifs'domains 
for proteins 



Supply protein and gene 
objects with sequences 
retrieved from 
GenBank 




i 

k 

i 



Generate persistent 
database objects 
corresponding to 
molecules and 
interactions among them 




Simulate 
' regulatory 
events, gene 
knockouts 
and "knockins" 



Group genes'proteins 
by their regulatory effect 
(e.g., tumor suppressors), 
one of the actions 
(e.g.DNA-bindin*). 
or system 
(e.g.,apoptosis) 



/ 
\ 



Make specialized 
sequence database 



Make specialized 
motiCdomaiii 
database 



Simulation hypothesis t^sbr^ 



Centra fi on 
of 

Spe ciul\\td 



FIGURE 1 



Text Article 




Preprocessor 



^ Tagging 



Grammar 



Parser 



Structured 
Form 



Recovery 



F i g are 2> 




Namciij 
Twowny 
NCBI.id 
Anionti) 
Mori/d) 
Pro «nCuTTtntStm 

One: 
Namm) 
Taxonomy 
NCBI.id 
GrnrCurrem State 



RNA: 
Namett) 
Taxonomy 
NCBI.id 
Anient i) 
Wo* iff I) 
RN AC unrmSme 




Prom «S tut: 
Modi/kauoms} 
Tissue 
Subrrlhjlar 
liLiprcued 
Rema mmaLifeTime 
ItAaivr 



GcneSLau: 
Modification 

Ttuue 
hTnvurribed 
UTrxraijied 
I&Aaive 
CurrrntEx pnsj tonftjtc 



Bra 

Urmh 



■id: 




RN AS Late: 

Modificaiionfs) 
Thsoe 
SubreJlular 
RemammaLifeTime 
UArtnr 



Complex: 
Nameu) 
S ubuxnreaAf P***red 

Com p* ex Current! 



SmaiLMottcrrieStala: 
Modifies ionii) 
Tmne 
Subrcltulix 
R mi in mali/cTtme 



Action: 
ObjenSubstance i 
SubjeaSubsianetT 

^Symmetrical 
Object Initials taie I 
ObjenFinalSiate j 
5 ubjrct Inn »IS taie i~ 
SubjenFmaiStaie i 
Result • 



/ \ 



\ 



V 





ConaparxSlai*: 
StwoOf A gfrrrxtsciSubi tinea 
Pii nOfl merin in u bamces 
Tuxuc 
Subcellular 
RemajnmaJJfeTime 
IsAnnr 




Similarity: 
SubsiJjKeA 
SubsianccB 
StjmmrOfA 
SeymcntOfB 
krpwOfS imilanrv 



Fhfmptwwyiatt: 

Subieniiiei iiAfTened 



AcrtytatfCyi: 

SubiertShff HA/TeCTed 



AxrrytaxaAA5Ad#Ch^B: 
SpbierLShntUfTectrt 



SufejertS be i lA/Teaed 



Bind: 

SubieQSitemAff>»^ 



Release 

SubieaShetslAffp^ 

Modify: 

SubjeaSitco^fteaed 



SobjcaShaiiAfTttMd 



AeasyaataCym: 

Su b^rtS fart i *AiTcacti 




Mak*CyvCjiBof»a: 

SybjcnSiteinAJFertcd 



W^cyiau: 



SMtoyectShwiiAneCTtd 



Pliajl 4 njiiuifaat*: 



O |t|ii»|»iiaS„TW; 

SttbieftShttg^Affertgri 



TruiBCft: 

DeMm~*~» 



1 



SubaancxDcxaached 



S«t»nShcmAfl acted 



| I>»«7*aj«LraAr v GiaAa»: 
S«b|crtSbH»iA/7ecttd 



FIGURE 
A3 1869 A 



Qam: 
SobjeoS itc »iAfiened 




S«b?«t5iiemAJTeead 



ScbxerLShcfi^Affecied 



(SHEET OF a) 



A3 1869 A 



(SHEET OF rs ) 
FIGURE H 



A31869A 
(SHEET s OF X ) 



Gene tree 



Species tree 




rax 



O dnpiicaiioo due to ^Trillion 
9 inrafRXHoc dapiiouioc 




FIGURE S* 




Q02252 MMSOH human 
[002253 MMSDH rat 

[ Q07538 MMSDH tKMoe 



MMSDH 



P4241 2 MMSOH 6. subOs 
1788042 ALDH 6. subbb 



913941 BADH B/assca napw 
P 4*562 ALDH (puL) C. eJroa/w 
1353248 ALDH4 human 



P 4 2236 ALDH (puL) B. sut)8b 
P 39634 ALDH fi. JUDta 

P33008 ALDH (puL) PzeuOomoms sp. 
1742508 GAB DH E CO* 



ALDH 4 



P43503 ALDH P. pufcto 

1790871 ALDH Comamonas testoste/onr 
P25553ALDHA£co# 
556221 ALDH11 human 



P25526 SSDH £ col 

P38067 ALDH (put) yeest 
P23883 ALDH (put) £cof 



" ALDH 11 



P 19059 ALDH PsaudbmoftM pufcto 
. P23 105 ALDH Pseuovnonas puadi 
587110 ALDH E coi 

A42597 ALDH AfcjJgew euovpttus 



P23240 ALDH Vfc/» cAolarae 
576666 ALDH fl/KXJoeoccus Sp. 
1790014 ALDH B £ cof 

P37685 ALDHB £. cof 

927643 BADH HortfetJrn vuga/» 
520546 BADH Sorghum ticotor 
520544 BADH Stv^um 



1 1 84S2 BADH Spinta* C*3C»« 
1813538 BADH Spin*cm oiwacea 
17936 BADH Bert vugam 
1 18490 BADH 8era vu&am 
17934 BADH fiefi vu&am 
166484 ALDH Asper&ta niger 
P08157 ALDH £me/xrt*a mdulm 
467625 ALDH ClaOowonum ne/oe/um 

487615 ALDH Atomm eiematt 
1 749700 ALDH S. pomot 
_ P 40047 ALDH <puL) yeast 

P 22281 ALDH1 yeast 
1P32872ALDH2 yeast 

529223 ALDH C. etooans 
P11884 ALDH2 rat 
466254 ALDH2 mousa 
P05091 ALDH 2 human 
P20000ALDH2 bovine 
P12762 ALDH2 horse 
P30837 ALDH5 human 
1743354 ALDH Nicotian* tatMCum 
544482 ALDH6 human 
408453 ALDH1 shrew 
P27463 ALDH1 chicken 
P13601 ALDH1 rat 
P15437ALDH1 horse 
P00352ALDH1 human 
P 2 4549 ALDH 1 mouse 
527682 ALDH1 sheep 
537498 ALDH1 bovine 

P48329 ALDH (put) B. subtib 



Plant 
betaine 
ALDH 



Fungus/Plant 
/Animal 
ALDHl/2/5/6 



p 12693 ALDH PseuOomow c*ovorance 

P30840 ALDH Enfomoeoa tustotoca 

P43353 ALDH 7 human 
{_ ALDHB human 



. ALDH 3 human 
. P1 1883 ALDH 3 rat 

ALDH10 human 
-L7p30839 ALDH 1 0 PR 



ALDH3/7/Z/10 



ALDH9 human 
118491 BADH £. cof 
M4540* SADHEcoJ 
\ 145404 BADH E. coi 



ALDH9 



145402 choune ALDH £. cof 
{ 114919 cnobne ALDH Ecol 



0.25 



FIGURE 6 
A3 1869 A 

(SHEET h OF 53 ) 



59 



56 



7iLP24S49 ALDH1 mous* 
6fir1 527682 ALOHIlhaap- 
53 Pj H 537498 ALOH1 bom. 

A I P15437 ALDH1 hon% 

99^1 p 1»<" AUDH1 rat 

93, P27463 ALOH1 dvckan 

—J ! 408453 ALDH 1 irrtw 

I 5444*2 ALDH6 htm an 

1743354 ALDH Mccoan» &b*cum 
_ 529223 ALDH Ca#ocma£>o&i **9«ra 
_ P30837 ALDH5 human | ^iptf] 
P20000 ALDH2 bovin* m 
P12762 ALDH2 hor»« 

ALDH2 



0. 

71 U 
941 



ALDH} 



JALDH6 



-- P05091 ALDH 2 human 
100| P11884 AL0H2 rat 
J4M2S4ALDH2 mous* I 



97 



100 IT 
76" 



166484 ALDH Aspwyt//u3 nytr 
P08157 ALDH £m»nc*4« r***m 
467625 ALDH Oaoaspcyrum /wrtwrum 
_ 467615 ALDH AJfrcn*na atfrmata 

P22281 ALDH1 y»«*t 



Fungus/Plant/A nimai 
ALDH 1/2/5/6 



-rr; P32872 ALOH2 yaast 

**' Bumif At r\u j .. » 



74 _ 

77^ 
99 H 1 * 



.P4O047 ALDH {put) yvasi 
99 r 1790014 ALDH £ cctf 



_ 576666 ALDH Ahoaococcut jp 



P37»8SALDHB£caft 



78_T 



_ P23240 ALDH VJMo ct&f— 

. A42597 ALDH Atea*9»o«i •ucroprtia 



927643 6ADH HorOMMt vt*o«f» 
520546 BAOH So/ynum Cvcoror 
520544 BAOH Soryntm bcotor 



1001 



100 ( 116492 BADH Sc*n»aa 
J— i 1813538 BAOH Spmto* <tf*raCM 
— : , 17936 BADH Beta vulgaris 
SBUj 116490 BAOH S*0 vu^ans 



100 



rC 



1 17934 BADH 6«* n*oa*» 

P19059 ALDH Fsauaamena* putt* 

P23105 ALDH ftftanonu puedfe 

. P23863 ALDH (put) £ CO* 



Plant betain 
ALDH 



ALDH9 



95 L 



88 



5871 10 ALDH £ «* 

ALDH9 human 

145405 BAOH £ CO* 

-rrr-hiB49i Baoh £ can 

. P46329 ALDH (put) £ subbks 



9S,j 

5BI 



~nsa 



Pi 2693 ALDH ftauoomonaj otovwwct 
. P30640 ALDH1 £ fittobtK* 
P43353ALDH7 human, ALDH' 



99 



str 



- T ALDH 8 human a XUVU 
ALOH3 human | 
. P116S3 ALDH3 rat | 

, ALDH 10 human I 

0*0639 ALDH1Cr*| 
1790871 ALDH Com»mon»s tmstasmron 

P33O06 ALDH (put) PfuOomonms Jp. 

1742508 GABOH £ cab 



Bacteria/Protozoan/ 
Animal 

ALDH 3/7/8/10 



P43503 ALDH P. puoo* 



r 



HZ? 

98"-" 
68" 



„ P25553 ALDHA £ CO* 
_ 556221 SSDH human 



100 M" 



_ P25526 SSDH £ CO* 

P38067 ALDH (SSDH?) _ 

1788042 ALDH £ CO* 

1353248 ALDH 4 human 

913941 BAOH firovca naou* 

_ P49419 anuqurtln human 

. P46562 ALDH <put) C vmjvu 



SSDH 



. ALDH4 



99 u 



JSL P42236 ALDH (put) £ JUCfiAs 
. P42412 MMSDH 8. JuttOs 

007536 MMSDH bomtw 
-T O02252 MMSDH human 



antiquitin 



100(002253 MMSDH f 
. P39634 ALDH £ Sufttes 



MMSDH 



„ 145402 ALDH £ < 



0.20 



FIGURE * 



A31869A 



(SHEET * OF « ) 



FIGURE « 



Stan with a single Biological system 



Start witn 
a single gene 



Stan with 
a gene family 



Reconstruct a "network" of interacting genes and proteins 




Identify a set of key domains and motifs 



Search for related motifs in databases of known organisms 





Gen© 


J — . 

Geo* _ 






Identify members of multigene families 




Compute phylogenetic trees 



Paralogous nrvvorks 



Identify clusters of paraiogous genes, identify paralogous and ortholB^ffs ff&rworks 



Misfinu networfc 



Paralogous networks 
in human 




Compare regulatory schemes, identify genes that are known in one but 

missing in another system. 
Find the genes using experimental techniques. 



A3 1869 ft 

(SHEET t OF B ) 



A31869 A 
(SHEET 4 OF ■£ ) 




FIGURE f 



A31869 ft 
(SHEET lo OF V ) 



V 




Spccies B 

Species A i 



FIGURE i© 



A31869 ft 
(SHEET J I OF 23 ) 




FIGURE I 



A31869 A 



(SHEET I a OF B ) 



Start with 



Biological system 



] 



Collect proteins/genes 
related to this system 



Identify set of the. key 
domains/motifs r 



Compile sequenc^aJrgjimrat 
for each domaWmbtffr^ 



Trs:n one of the mo^i(^iti^fc 
protein motif searcK^gpSlm^ 
to identify these motxSSc^- 
"-protein sequen^^f 1 ^*^^ 



: Search forTclat^'motifi 
«■ in human'ESTTdataba^ 




Search for related motifs in yeast 
r .>nd nematode genomes, 
then compare.idenrified unannotated 
. genes with human EST debase 




FIGURE IX 



H3 18 6<i R 

(Sheet B of 03 ) 




gi|2274880 

£12274882, gjl2274884, 
gi 12276170 

gi|1397285 




gi [2414340 



gi 1 12] 61 23, gi|732215 ) git2226406, gi (1353 147, gijl070084, git!71 1486 




giJ1914353 

gi|2315789,gi|23157S5,gi|974791, 
gi|1938433,gi|2315784, gi|1707170, 
gi|2429497, gj| 104 1322, gif1707181, 
gi (2429422, gi [2315568, gi£3 15783 

gi|1465S36, ^1707203, gif2315752, 
gi}465779, gif 1 707205, gi| 24294 93, 
£[1707215, gi|!O7O062, gi|231Jd55 



gtf2291257 




gi|2394485 



132514 



.gi|2497016 - ; - • : 

gjjl914354, #11293841, gJ|132636<V . 
giJ132636J (B0496.2Xv\lQbl93G r 
gt}23 15543, gi] WW 9377, gi|1049378, 
' gj|l 132541, gj!1903J02,gi|n22793. 
' ^086886,^11729689, g)|i2564>6, 
giT143S723..grin5W51 i gjji4293T2 

"g32384930 



• £[2313652.. ' 




£[2315779, £{2315571, 
gi!465778 

git2315569, gjf 1903070 

£[2429533, gi|23 15567, £[1903069, 
gj|465837,£}242954] 

gip29718,giil526968 (MEL26), 
£[466032, £[1176717 

£U707217,£|n07213, £[1707214, 
gi [2315750, £[2315635, £|1707216, 
gt[2315748, £[1707212, £[1707202, 
£[2315636, g)[23 15655, p|23 15634, 
£(2315541, £[231 5634 

gi J868172 
gi[23 15751 

£[1707204, gi|l 707206, £[2315660, 
gi|23 15661, gi|2315757 (POZ is truncated) 



h £{2315£?a(POZxs:truncate6V 
..^12315788; g^2429424 ■ 



POZ/BTB domain 
<=> Ketoh repeal 

© nng finger domaui 



▲ 



fibronectin ID domain 
cyd in repeat 

EGF-like domain 
CUB domaui 

laminin EGF-like domain 
transmembrane bdix 
BP - bipartite nud ear localization signal 
transferase domain 
new A-domain 

new PO&4intcer domam 
new B-domain 
new SPOP domain 
new(7) TN (Tumor Necrosis) domaiii 

foa/jun DMA-binding domain 
ES> £il70n04-domain 



new HAT domam fHemag£utinin, 
Alpha toxin, Tumor necrosis-fad or -alpha- 
induced protein) 



death domain 
kinase dontas 



"=> 
© 



Zn-fingerC,Hj 

Proiine-nch region 

PKC-C1 domain, G AG /PE -binding 

PKC-C2 domam 

protein kinase domam 



Figure B 



B3I8 6^ ft 
(Sheet W of & ) 



>gi I 22107 66 I gb I AA4 81214 I AA4 81214 aa34e02.irl NCI_CGAP_GCB1 Homo sapiens cDNA clon 

IMAGE: 815162 5' similar to WP:W07A12.4 CE03795 ;, mRNA sequence [Homo sapiens] 

CATGGCTTCCTGGACACCAACCCTGCCATCCGGGAGCAGACGGTCAAGTCCATGCTGCTCCTGGCCCCAA 

AGCT GAAC G AGGCCAACCT C AAT GT GGAGCT GAT GAAGC ACT T T GCACGG CT ACAGGCCAAGGAT GAACA 

GGGCCCCATCCGCTGCAACACCACAGTCTGCCTGGGCAAAATCGGCTCCTACCTCAGTGCTAGCACCAGA 

CACAGGGTCCTTACCTCTGCCTTCAGCCGAGCCACTAGGGACCCGTTTGCACCGTCCCGGGTTGCGGGTG 

T C C T GGGCT T TGCTGCC ACC CAC AACCT CT ACT CAAT GAACGACTGT GCCC AGAAGAT CCTGCCT G TGCT 

CTGCGGTCTCACTGTAGATCCTGAGAAATCCGTGCGAGACCAGGCCTTCAAGGCA 



>gi|134'9211|gb|W51957lW5l957 zc45f01.rl Soares_senescent_f ibroblasts_NbHSF Homo 

sapiens cDNA clone IMAGE: 325273 5', mRNA sequence [Homo sapiens] 

CCTTCGAGTTCGGCAATGCTGGGGCCGTTGTCCTCACGCCCCTCTTCAAGGTGGGCAAGTTCCTGAGCGC 

T G AGG AG T AT C AGCAGAAGAT CAT C CCT G T GGT GGT C AAGATGT T CTCAT C C ACTGACCGGGCC AT GCGC 

ATCCGNCTCCT GC AGCAGAT GGAGCAGTT C AT CC AGT ACCT TGACGAGCC AAC AGT CAACACCCAGAT CT 

TCCCCCACGTCGTACATGGCTTCCTGGACACCAACCCTGCCATCCGGGAGCAGACGGTCAAGTCCATGCT 

GCT CCT GGC CCC AAAGCT GAACGAGGC C AACCT CAAT GT GGAGCT GAT GAAGC ACTTT GCACGGCTACAG 

GCCAAGGATGAACAGGGCCCCATCCGCTGCAACACCACAGTCTGCCTGGGCAAAATCGGCTCCTACCTCA 

GTGCTAGCACCAGACACAGGGTCCTTACCTCTG 




(Sheet »S" of m ) 



100 



•H.sapiens 



85 



-C.elegans_e1350092 

S.pombe_013733 

S.cervisiae_S60992 

Nicotiana tabacum e244568 



0.15 



a 
m 

Q 

m 

H 

S 

Q 



Q 
G 



A 



BASE COUNT 4 05 a 545 c 

ORIGIN 

1 cagccgaagc amgcaaaaat 
61 ggayttctgt cggcacaagg 
121 tggggccgtt gtcctcacgc 
181 tcagcagaag atcatccctg 
241 catccgcctc ctgcagcaga 
301 cacccagatc ttcccccacg 
361 gcagacggtc aagtccatgc 
421 ggagctgatg aagcactttg 
481 caacaccaca gtctgcctgg 
541 ggtccttacc tctgccttca 
601 gggtgtcctg ggctttgctg 
661 gatcctgcct gtgctctgcg 
721 cttcaaggcm wttcggagct 
781 gctggaggaa gtggagaagg 
841 agctagctgg gcaggctggg 
901 tcgcacccaa ccactgcccc 
961 gttcctgccc cagcccccac 
1021 acgcaggagg aggacaagga 
1081 gaagactggg gcagcctgga 
1141 agcaccgggg gccaagtgag 
. 1201 aaatccccag agtccgactg 
1261 caggagccaa gctcccagga 
1321 tggggtggcc cagagtccag 
1381 agcacccagc cgaggccaga 
1441 agtcgacagg^ tcaaggctga 
1501 gaggccaaac gcgccgagag 
1561 tggactgaac cgtggcggtg 
1621 tattgtacaa accatgtgag 
1681 gagccacaat aaattctatt 

// 



493 g 

tcttccagga 
tgctgcccca 
ccctcttcaa 
tggtggtcaa 
tggagcagtt 
tcgtacatgg 
tgctcctggc 
cacggctaca 
gcaaaatcgg 
gccgagccac 
ccacccacaa 
gtctcactgt 
tcctgtccaa 
atgtccatgc 
cgtgaccggg 
aacagaaacc 
ccctgttcct 
cacagcagag 
gcaggaggcc 
ccgtgctagt 
gagcagctgg 
gccacctyct 
cgacaagggc 
ctcttggggt 
gctggoccgg 
gaaggtgcca 
gcccttcccg 
cccggccgcc 
tcacaaaaaa 



278 t 

gctgagcaag 
gctgctgacc 
ggtgggcaag 
gatgttctca 
catccagtac 
cttcctggac 
cccaaagctg 
ggccaaggat 
ctcctacctc 
tagggacccg 
cctctactca 
agatcctgag 
attggagtct 
agcctccagc 
gtctcctcac 
aacattcccc 
gccaccccta 
gacagcagca 
gagtctgtgc 
caggtcagca 
gaarctgagg 
gacggtacac 
gaccccttcg 
gaggacaact 
aagaagcgcg 
agggccccat 
gctgcggaga 
cagccaggcc 
aaaaaaaaaa 



6 others 

agcctggacg 1 
gccttcgagt 
ttcctgagcg 
tccactgacc 
cttgacgagc 
accaaccctg 
aacgaggcca 
gaacagggcc 
agtgctagca 
tttgcaccgt 
atgaacgact 
aaatccgtgc 
gtgtcggagg 
ccfcggcatgg 
tcacctccaa 
aaagacccac 
caacctcagg 
ctgctgacag 
tggcccagca 
artccgacca 
, gcrtccrtggga 
ggcrtggccag 
cta<:cctgtc 
gggagggcct 
aggagcggcg 
gaagctggga 
gcccgcccca 
atctcacgtg 
aaaaaaa 



cattccctga 

tcggcaatgc 

ctgaggagta 

gggccatgcg 

,caacagtcaa 

ccatccggga 

acctcaatgt 

ccatccgctg 

ccagacacag 

cccgggttgc 

gtgcccagaa 

gagaccaggc 

acccgaccca 

gaggagccgc 

gctgatccgt 

gcctgaagga 

ccactgggag 

atgggacgac 

ggacgactgg 

caaatcctcc 

acagggctgg 

cgagtataac 

tgcacgtccc 

cgagactgac 

gcgggagatg 

gcccggaagc 

cagatgtatt 

tacataatca 



Figure If 



(thai; n*/*3) 



o 10 15 20 25 30 



1 


S 


R 


S 


X 


Q 


K 


F 


F 


Q 


E L 


S 


K 


s 


L 


D 


A 


F 


P 


E 


D 


F 


C 


R 


H 


K 


V 


L 


P Q 


31 


L 


L 


T 


A 


F 


E 


F 


G 


N 


A G 


A 


V 


V 


L 


T 


P 


L 


F 


K 


V 


G 


K 


F 


L 


S 


A 


E 


E Y 


61 


Q 


Q 


K 


I 


I 


P 


V 


V 


V 


K M 


F 


S 


s 


T 


D 


R 


A 


M 


R 


I 


R 


L 


L 


Q 


Q 


M 


E 


Q F 


91 


I 


Q 


Y 


L 


D 


E 


P 


T 


V 


N T 


Q 


I 


F 


P 


H 


V 


V 


H 


G 


F 


L 


D 


T 


N 


P 


A 


I 


R E 


121 


Q 


T 


V 


K 


S 


M 


L 


L 


L 


A P 


K 


L 


N 


E 


A 


N 


L 


N 


V 


E 


L 


M 


K 


H 


F 


A 


R 


L Q 


151 


A 


K 


D 


E 


Q 


G 


P 


I 


R 


C N 


T 


T 


V 


C 


L 


G 


K 


I 


G 


S 


Y 


L 


S 


A 


S 


T 


R 


H R 


181 


V 


L 


T 


S 


A 


F 


S 


R 


A 


T R 


D 


P 


F 


A 


P 


S 


R 


V 


A 


G 


V 


L 


G 


F 


A 


A 


T 


H N 


211 


L 


Y 


S 


M 


N 


D 


C 


A 


Q 


K I 


L 


P 


V 


L 


C 


G 


L 


T 


V 


D 


P 


E 


K 


S 


V 


R 


D 


Q A 


241 


F 


K 


A 


X 


R 


S 


F 


L 


S 


K L 


E 


S 


V 


S 


E 


D 


P 


T 


Q 


L 


E 


E 


V 


E 


K 


D 


V 


H A 


271 


A 


S 


S 


P 


G 


M 


G 


G 


A 


A A 


S 




A 


G 


W 


A 




























H P 



F)3I A 
(Sheet il of b 1 ) 



>sp|P15533|RPTl JVIOUSE DOWN REGULATORY PROTEIN 
OF INTERLEUKIN 2 RECEPTOR (J03776) rpt-lr [Mus 
musculus] Length = 353 

Score = 92.0 bits (237), Expect - 6e-20 



Query 194 VMELLEEDLTCPICCSLFDDPRVLPCSHNFCKKCLEGILEGSVRNSMWRPAPFKCPTCRK 373 

V+E+++E++TCPIC L +P C+H4FC+ C+ E S RN+ CP CR 

Sbjct 5 VLEMIKEEVTCPICLELLKEPVSAIXNHSFCRACITLNYE-SNROT DGKGNCPVCRV 60 

Query 374 ETSATGINSLQVWYSLKGIVEKYNKIKISP KMPVCKGHMGQPLNI FCLTDMQLICG 541 

+L+ N + IVE+ K P K+ +C H G+ L +FC DM +IC 
Sbjct 61 PYP FGNLRPNLHVAN IVERLKGFKS I P EEEQKVN1 CAQH- GEKLRLFCRKDMMVI CW 116 

Query 542 I CATRGEHTKHVTCSI EDAYAQERDAFESLFQSF ETWRRGDALSRLDTMETSK 700 

+C EH H IE+ + ++ + + W- L R+D 
Sbjct 117 LCERSQEHRGHQTALI EEVDQE YKEKLQGJU^WKLMKKAKE CDEWQDDLQLQRVDW 171 

Query 701 RKS LQLMTKDS DKVKE FFEKLQHTLDQKKNE I LS D FETMKLAVMQAYDPE I NKL 8 62 

+Q+ + + V+ F+ L+ LD K+NE L + K VM+ + N+L 
Sbjct 172 ENQIQI NVENVQRQFKGLRDLLDSKENEEWKLKXEKKEVMEKLEESENEL 222 



Homology covers ring finger, B-box and the beginning of coiled coil domain 
in the CLL ring finger protein 



(Sheet 11 of B ) 



Activated CD4 + T-cells 

Rptl (represses expression of DL-2 receptor) 

IL-2 receptor — — ► normal expression of Bcl2 

A 

IL-2,IL-15 * 

normal apoptosis 



When rptl is knocked out: 

IL-2.IL-15 




over expression of Bcl2 



apoptosis 



TBLASTN 2.0,6 ( Jan-05-1 999] 

Reference: 



Altschul, Stephen F. . Thomas t m ,^. . 

Jinghui Zhang, Zheng Zhano^b Mifw ^T*? A " Shaffer, 
"Gapped BLAST and PSI-BLaIt- ! Mlller ' David J. Lipman (1997), 



search 



Query« gi i 2137 4 98 I Mad3m 
(205 letters) 

gb|AA276224|AA278224 zs77 e 05.rl NCI CGAP rtp, u 

Length * 430 
Score - 209 bits (526) , Expect - le-53 

= ""^ P ° Si " V « » <92„, Gaps - 1,124 (M) 

ouery: 1 ^i 1 ^ *° 

sb3 =t: 56 --HIQVLLQ^S^ 235 

Query: 61 ™|!™^ llg 

-*e, 23 e == 415 

Query: 120 EKLRS 124 
E+LR4 

Sbjct: 416 ERLRT 430 

dbj |C02407|C02407 HUMGS0012279 m. m »n o< 

Length - 348 ' HUMn Ge " e si 9""ure, 3'-directed cDNA sequence. 

Score - 97.5 bits (239), Expect - 6e-20 
Fram""" " ^ t80 *'' Po8ltlv « » 56/63 ,87%) 

Query: 125 j^^^^^^^^^^^ERERLRADSLDSSGLSSERSI)SDQEDLEVEWNLVreTETE 184 

S b >c t : 45 ^X^^^ , 

Query: 185 LLQ 187 
LL+ 

Sbjct: 225 LLR 233 



R3I 8-6<l ft 



BASE COUNT 130 a 234 c ,258 g 106 t 5 others 

ORIGIN 

1 cagccgcttg ctccggccgg caccctaggc cgcagtccgc caggctgtcg ccgacatgga 

61 acccttggcc agcaacatcc aggtcctgct gcaggcggcc gagttcctgg agcgccgtga 

121 gagagaggcc gagcatggtt atgcgtccct gtgcccgcat cgcagtccag gccccatcca 

181 caggaggaag aagcgacccc : cccaggctcc tggcgcgcag gacagcgggc ggtcagtgca 

241 caatgaactg gagaagcgca ggagggccca gttgaagcgg tgcctggagc ggctgaagca 

301 gcagatgccc ctgggcggcg actgtgcccg gta caeca eg ctgagcctgc tgcgccgtge 

361 cagga.tgcac atccagaagc tggaggatca ggagcagegg gcccgacagc tcaaggagag 

421 gctgcgcaca aagcagcaga gcctgcagcg gcantggatg cagctccggg ggctggcagg 

481 ngeggecgag egggagegne tgegggegga cagtctggac tcctcaggec tctcctctga 

541 gcgctcagac tcagaccaag aggagctgga ggtggatgtg gagagectgg tgtttggggg 

601 tgaggecgag ctgctgcggg gcttcgtcgc eggecaggag ca cagctact: cgcacgtcgg 

661 cggcgcctgg ctatgatgtt cctcacccan ggcgggcctc tgccctctta ctcgttgccc 
721 aagcccactt tnc 



C 



>Af<u£3h(Putative) 

MEPIASNIQVLLQAAEFLERREREAEHGYASLCPHRSPGPIHRW^ 

RCLERLKQQMPLGGDCARYTTLSLLRRARMHIQKLEDQEQRARQLKERLRTKQQSLQRXWMQLRGIAGAAERER 
LRADSLDSSGLSSERSDSDOEELEVDVESLVFGGEAELLRGFVAGOEKSYSHVGGAWL 



\ 12506688 IMADn 
~l|i [72997 8 tMADh 
^Igi (2792362 |Had4h 
^gi 1 21374 99 IHacMn 
% 121374 98 IMad3m 
^tjad3h Putative 

f ;gil2S06888 IMADm 

1 729978 JMADh 
Vgi I 2792362 IMad4h 
gi 12237499 IKad4n 
s gi 1 21374 98 I Mad 3m 
" Mad3h Putative 



KATAVGMH I QLLLEAADY 
HAAAVRMNI QTILLEAADY 

KELNSLLI LLEAAEY 

MELNS LLLLLEAAEY 

-ME PVAS N I QVL LQAAE F 
-MEPLASNIQVLLQAAEF 



LERRIPXAEHGYASMLPYS-KDRDAFKRRKKPKKNST — SSR5THKt>5EXKRPAHLRLCLSKLKGLVPLGPESSRHTTLS LL 
LERJLEREAEHCSVASMLPVNNKDRDALKRRNKSKKNNS — SSRSTHHEMEKNRRAHLRLCLEKIJCGLVPLGPESSRHT/TLSLi. 
LERRDREAEHGVASVLPPDGDFAREKTKAAGLVRXAP — NNRSSHNELEKHRRAKLRLYLEQLKQLVPLGPDSTRHTTLSLL 
XERRDREAEHGYASMLPFDGDFAJUQCTKTAGLVRKGF — NNRSSHNELEKHRRAKLRLYLEQUCQLGPLGPDSTRHTTLSLL 
LERREREAEBGVAS LCPHKSPGTVCRRRKPPl^APGAI^SGRSVHNELEKRJUUkQL^ LL 
LEPilERIAEHGYASLCPHRSPGPIHRRKXRPPQAPGAQ 

TKAKLK • KXLEDOJRKAVHQIDQLQREQRHLKRRLEKLGAERTR MDSVG-SWS S ERSDSDREE LDVDVDVD VD VDVEG TD V LK3DLGWS S £ - 

TKAK LH I KK L E DCD RKA VHQ IDQLQREQRH LKRQLEKLG I E RI R KBS IG-STVSS ERSDSDRE E I DVDVES TDY LTGDLDHS S S S 

KRAJryHIKKLEEQDRRALSIKECLOOEHRFLKRRLEQLSVQSVER VRTDS TG-S AVS TD — DS EQE VDI EGMEFGPGELDSVGS- 

K-AKMHIKKLEEGDRRALSIKEQLQREHRTIJCRRLEQLSVQSVR VRTDS TG-S AVS TD — DS EQE VD I EGME FGPGE LDSAGS - 

R-ARVH1 QKLEE QE OQA RRLKE KL RS KQQS LQQQLEQLQG LPGAAE RE RLRADS LDSS GLS S E RS DS DQE DLEVDVENLVFG-TETELLQSF 

RRARM H I QK L ED QE QRA RQ LKE RL RT KQQS LQ RXWMQL RG LAGAAE RE RL RADS LD5 S GLS S E RSDS DQE ELEVDVES LVFG-GEAELLRGF 



::gil2506888! MADn V5DSDERGSMQS LG-S DEGVSSArvTCRAKLQDGHKAGUJL 

- ' : §±n 2997 8 i MADh V5 DS DERGSMQS I^-SDEGYSSTS-HfRIKLQDSHKACLGl. 

• gi 12792362 |Mad4h S5DADDHYSLQSGTGGDSGFGPHCRRUGRPALS 

r = gi 121374 99 lMad4m SSDADDHYSLQSSGCSDSSYGHPCRRPGCPGLS 

gi 12137498 !Mad3m SAGREHSYSHSTCAWL 

~"«ad3h Putative VAGQEHSYSHV3GAWL 



0.085 



1 00 j gi[2506888|MAD rn 

-gi|729978|MADh 



100 1 oil 



gi|2792362|Mad4h 
gi(2137499|Mad4m 



-gi|2137498|Mad3m 



B. 



100 L 



100 



• gi|2506888|MADm 
-gi|729978|MAD 
— gi|2792362|Mad4 



100>-gi|2137499|Mad4m 

gi|2137498|Mad3m 

Mad3h Putative 
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COMBINED DECLARATION 
AND POWER OF ATTORNEY 

(Original, Design, National Stage of PCT, Divisional, Continuation or C-I-P Application) 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name; I believe I am the original, 
first and sole inventor (if only one name is listed below) or an original, first and joint inventor (if plural names are 
listed below) of the subject matter which is claimed and for which a patent is sought on the invention entitled: 

GENE DISCOVERY THROUGH COMPARISONS OF NETWORKS OF STRUCTURAL AND 
FUNCTIONAL RELATIONSHIPS AMONG KNOWN GENES AND PROTEINS 

This declaration is of the following type: 

[] original 

[] design 

[] national stage of PCT 

[] divisional 

[] continuation 

L ;f [✓] continuation-in-part (C-I-P) 

!Ke specification of which: (complete (a), (b), or (c)) 

£i) IV] i s attached hereto. 

(tl) [] was filed on as Application Serial No. and was amended on (if applicable). 
(d) [] was described and claimed in 

1 a »f Acknowledgement of Review of Papers and Duty of Candor 

I hereby state that I have reviewed and understand the contents of the above identified specification, 
including the claims, as amended by any amendment referred to above. 

r\ I acknowledge the duty to disclose information which is material to the patentability of the subject matter 
claimed in this application in accordance with Title 37, Code of Federal Regulations § 1.56. 

[ ] In compliance with this duty there is attached an information disclosure statement. 37 CFR 1.98. 

Priority Claim 

I hereby claim foreign priority benefits under Title 35, United States Code, § 119(a)-(d) of any foreign 
application(s) for patent or inventor's certificate or of any PCT International Application(s) designating at least one 
country other than the United States of America listed below and have also identified below any foreign 
application(s) for patent or inventor's certificate or any PCT International Application(s) designating at least one 
country other than the United States of America filed by me on the same subject matter having a filing date before 
that of the application on which priority is claimed 

(complete (d) or (e)) 

(d) [ ] no such applications have been filed. 

(e) [ ] such applications have been filed as follows: 
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BAKER BOTTS L.L.P. 

FILE NO.: A31869-A 70050.1046 



PRIOR FOREIGN/PCT APPLICATION(S) FILED WITHIN 12 MONTHS (6 MONTHS FOR DESIGN) PRIOR TO SAID APPLICATION 


COUNTRY APPLICATION NO. 


DATE OF FILING 
(day, month, year) 


DATE OF ISSUE 
(day, month, year) 


PRIORITY CT AIMFD 
UNDER 35 USC 119 








[ ] YES NO [ ] 








[ ] YES NO [ ] 








[ ] YES NO [ ] 


ALL FOREIGN APPLICATION^], IF ANY, FILED MORE THAN 12 MONTHS (6 MONTHS FOR DESIGN) PRIOR TO SAID APPLICATION 








[ ] YES NO [ ] 








[ ] YES NO [ ] 








[ ] YES NO [ ] 



Claim for Benefit of Prior U.S. Provisional Application (s) 

I hereby claim the benefit under Title 35, United States Code, § 1 19(e) of any United States provisional 
application(s) listed below: 



Provisional Application Number 


Filing Date 


60/129,469 


April 15, 1999 











Claim for Benefit of Earlier U.S./PCT Application(s) under 35 U.S.C. 120 

J:,; (complete this part only if this is a divisional, continuation or C-I-P application) 

%2 I hereby claim the benefit under Title 35, United States Code, § 120 of any United States application(s) or 

RCT international application(s) designating the United States of America that is/are listed below and, insofar as 

life subject matter of each of the claims of this application is not disclosed in the prior application(s) in the manner 

jrfovided by the first paragraph of Title 35, United States Code § 112, I acknowledge the duty to disclose 

Information as defined in Title 37, Code of Federal Regulations, § 1.56 which occurred between the filing date of 

tile prior application(s) and the national or PCT international filing date of this application: 

jj£)/327,983 June 8, 1999 Pending 

'(Application Serial No ) (Filing Date) (Status) (patented, pending, abandoned) 



^Application Serial No ) (Filing Date) (Status) (patented, pending, abandoned) 

Power of Attorney 

As a named inventor, I hereby appoint Dana M. Raymond, Reg. No. 18,540; Frederick C. Carver, Reg. No. 17,021; Francis J. Hone, Reg. 
No. 18,662; Joseph D. Garon, Reg. No. 20,420; Arthur S. Tenser, Reg. No. 18,839; Ronald B. Hildreth, Reg. No. 19,498; Thomas R. 
Nesbitt, Jr., Reg. No. 22,075; Robert Neuner, Reg. No. 24,316; Richard G. Berkley, Reg. No. 25,465; Richard S. Clark, Reg. No. 26,154; 
Bradley B. Geist, Reg. No. 27,551; James J. Maime, Reg. No. 26,946; JohnD. Murnane, Reg. No. 29,836; Henry Tang, Reg. No. 29,705; 
Robert C. Scheinfeld, Reg. No. 31,300; John A. Fogarty, Jr, Reg. No. 22,348; Louis S. Sorell, Reg. No. 32,439; Rochelle K. Seide Reg. 
No. 32,300; Gary M. Butter, Reg. No. 33,841; Marta E. Delsignore, Reg. No. 32,689; and Lisa B. Kole, Reg. No. 35,225 of the firm of 
BAKER BOTTS L.L.P. , with offices at 30 Rockefeller Plaza, New York, New York 101 12, as attorneys to prosecute this application and 
to transact all business in the Patent and Trademark Office connected therewith 



SEND CORRESPONDENCE TO: 


DIRECT TELEPHONE CALLS TO: 


BAKER BOTTS L.L.P. 


BAKER BOTTS L.L.P. 


30 ROCKEFELLER PLAZA, NEW YORK, N.Y. 10112 


(212) 705-5000 


CUSTOMER NUMBER: 21003 



I hereby declare that all statements made herein of my own knowledge are true and that all statements made 
on information and belief are believed to be true; and further that these statements were made with the knowledge 
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BAKER BOTTSLL.P 

FILE NO.: A31869-A 70050.1046 
that willful false statements and the like so made are punishable by fine or imprisonment, or both, under Section 
1001 of Title 1 8 of the United States Code and that such willful false statements may jeopardize the validity of the 
application or any patent issued thereon. 



FULL NAME OF SOLE 
OR FIRST INVENTOR 


LAST NAME 

RZHETSKY 


FIRST NAME 

ANDREY 


MIDDLE NAME 


RESIDENCE & CITIZENSHIP 


CITY 

New York 


STATE or FOREIGN COUNTRY 

New York 


COUNTRY OF CITIZENSHIP 

Russia 


POST OFFICE 
ADDRESS 


POST OFFICE ADDRESS 

560 Riverside Drive, 1 IF 


CITY 

New York 


STATE or COUNTRY 

New York 


ZIP CODE 

10027 


DATE 


SIGNATURE OF COVENTOR 




FULL NAME OF SECOND 
JOINT INVENTOR, IF ANY 


LAST NAME 

KALACHIKOV 


FIRST NAME 

SERGEY 


MIDDLE NAME 


RESIDENCE & CITIZENSHIP 


CITY 

New York 


STATE or FOREIGN COUNTRY 


COUNTRY OF CITIZENSHIP 

Russia 


POST OFFICE 
ADDRESS 


POST OFFICE ADDRESS 

154 Haven Avenue, 1303 


CITY 


STATE or COUNTRY 

New York 


ZIP CODE 

10032 


DATE 


SIGNATURE OF INVENTOR 


; F|ULL NAME OF SIXTH 

JJJ11N 1 UN V cJN 1 UK, lr i\!lS 1 


LAST NAME 

KRAUTHAMMER 


FIRST NAME 

MICHAEL 


MIDDLE NAME 

O. 


RESIDENCE & CITIZENSHIP 


CITY 

New York 


STATE or FOREIGN COUNTRY 


COUNTRY OF CITIZENSHIP 

Switzerland 


; gDST OFFICE 


POST OFFICE ADDRESS 

27 W. 76th Street, Apt. 3A 


CITY 

New York 


STATE or COUNTRY 

New York 


ZIP CODE 

10023 


DATE 


SIGNATURE OF INVENTOR 


JfJJLL NAME OF SECOND 
: J0INT INVENTOR, IF ANY 


LAST NAME 

FRIEDMAN 


FIRST NAME 

CAROL 


MIDDLE NAME 


RESIDENCE & CITIZENSHIP 


CITY 

Larchmont 


STATE or FOREIGN COUNTRY 


COUNTRY OF CITIZENSHIP 

United States 


= POST OFFICE 
;A^DDRESS 


POST OFFICE ADDRESS 

14 Dimitri Place 


CITY 

Larchmont 


STATE or COUNTRY 

New York 


ZIP CODE 

10538 


DATE 


SIGNATURE OF INVENTOR 


FULL NAME OF SIXTH 
JOINT INVENTOR, IF ANY 


LAST NAME 

KRA 


FIRST NAME 

PAULINE 


MIDDLE NAME 


RESIDENCE & CITIZENSHIP 


CITY 

Forest Hills 


STATE or FOREIGN COUNTRY 


COUNTRY OF CITIZENSHIP 

United States 


POST OFFICE 
ADDRESS 


POST OFFICE ADDRESS 

109-14 Ascan Ave 


CITY 

Forest Hills 


STATE or COUNTRY 

New York 


ZIP CODE 

11375 


DATE 


SIGNATURE OF INVENTOR 





Check proper box(es) for any added pagefs) forming a part of this declaration 
[ ] Signature for ninth and subsequent joint inventors. Number of pages added . 



[ ] Signature by admimstrator(trix), executor(trix) or legal representative for deceased or incapacitated inventor. 

Number of pages added . 

[ ] Signature for inventor who refuses to sign, or cannot be reached, by person authorized under 37 CFR 1.47. 
Number of pages added . 
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% lexsemsub.pl 

% 1 exsemsub . pat 

% revised March 17, 2 000 

% LEXICON OF SUBSTANCES AND STRUCTURES 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
multifile (phrase/5) . 
multif ile (wdef /3) . 
unknown (_, fail) . 

phrase ('[' , protein, ['[' , gamma , ' ]', ' -', aminobutyric , acid, a], 1 G A 
BAA' , r) . % ? 

phrase ( 1 [', smallmolecule, [ ' [ ' , zeta, ' ] * , 1 , subunit] , ' [zeta] 1 subu 
nit ' , r) . % ? 

phrase (116, protein, [116, ' - 1 ,kd, fyn, 1 - 1 , associated, protein] , ' 116-k 
D Fyn-associated protein *,r) . 

phrase (116, protein, [116, 1 - 1 ,kd,protein] , ' 116-kd protein* , r) . 

phrase (3 , protein, [3 , ' - 1 , kinase, ' - ' , akt] , ' 3 -kinase-Akt ' , r) . 

phrase (ability, affirmation, [ability, to], [] , r) . 

phrase (age, protein, [age, protein, kinases], f AGC ' , r) . 

phrase (akt , protein, [akt, mutant], 'Akt mutant', r) . 

phrase (alternative, substance, [alternative , ntf] , 'alternative NTF ' , r 

) . 

phrase (antibody, protein, [antibody , to, phosphotyrosine] , 'anti-phosp 
hotyrosine ' , r) . 

phrase (antigen, complex, [antigen, receptor] , 'antigen receptor ', r) . 
phrase (ap, protein, [ap, 1 - ' , 1] , ' AP-1 ' , r) . 

phrase (aspargine, site, [aspargine, * - * , 141] , 1 aspargine-141 f , r) . 
phrase (b, cell, [b,cell], ' B cell*, r) . 
phrase (b, cell, [b, cells], 'B cell', r) . 

phrase(b, species , [b, lymphoblastoid, cells] , 'B lymphoblastoid cell 
s ' , r) . 

phrase (b, cell, [b, lymphoblastoid, cells] , 'B lymphoblastoid cells',r 
) ■ 

phrase (b7, protein, [b7 , ' - 1 , ' 1 ' ] , 'BV-l^r). 
phrase (bcl , protein, [bcl, ' - 1 , 2] , 'Bel -2 * , r) . 
phrase (c, protein, [c,'-',jun] , r c-Jun',r). 
phrase(camk, protein, [camk, iv] , * CaMK IV T ,r). 
phrase (casp, protein, [casp, r - * , 3] , ' caspase-3 ' , r) . 

phrase (caspase, protein, [caspase, '-', 3 , family , protease] , 'caspase-3 

family protease* ,r) . 
phrase (caspase, protein, [caspase, '- 1 , 3 , precursor] , 'caspase-3 precur 
sor ' , r) . 

phrase (caspase, protein, [caspase, ' - ' , 3] , ' caspase-3 1 , r) . 
phrase (caspase, protein, [caspase, -,3] , * caspase-3 ' , r) . 
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phrase (caspase , protein, [caspase, ' - ' , 6] , ' caspase-6 ' , r) . 

phrase (caspase, protein, [caspase, 1 - r , 7] , 1 caspase-7 1 , r) . 

phrase (catalytic, domain, [catalytic, domain], 'catalytic domain', 

r) . 

phrase (cleavage, site, [cleavage , site] , 'cleavage site',r) . 

phrase (cleavage, substance, [cleavage, products] , 'cleavage products', 

r) . 

phrase (cooh, substance, [cooh, '-' , terminal , fragment ] , ' COOH-termina 
1 fragment 1 , r) . 

phrase (crk, protein, [crk, proteins] , 1 crk proteins r , rO . 

phrase (crkl, complex, [crkl, '- ' , c3g, complex] , »crkl-c3g complex' , r) . 

phrase (dcp, protein, [dcp, - , 1] , 'DCP-1 ' , r) . 

phrase (did, negation, [did, not], not, r) . 

phrase (ebv, species, ' Epstein-Barr virus', r). 

phrase (epstein, species, [epstein, '-', barr, virus] , 'Epstein-Barr vi 
rus ' , r) . 

phrase (familial, disease, [familial , alzheimer, ■ 1 1 ',s, disease] , 'famil 
ial Alzheimer ' ' 1 ' s disease ' , r) . 

phrase(gene, gene, [gene , encoding, interleukin, '-',2] , 'gene encodin 
g interleukin-2 ' , r) . 

phrase (gst, protein, [gst, ' - 1 , ' fyn 1 , ' - ' , sh2] , ! GST-Fyn-SH2 ' , r) . 
phrase (gst, protein, [gst ,'-',' fyn ','-', sh3 ] , ' GST-Fyn-SH3 ' , r) . 
phrase(gtp, complex, [gtp, exchange, of , rapl] ,' GTP exchange of Rapl ' , 
r) . 

phrase (guanidine, protein, [guanidine, nucleotide, ' - ' , releasing, fac 
tor, c3g] , 'guanidine nucleotide-releasing factor C3G',r). 
phrase (guanidine, smallmolecule, [guanidine, nucleotide] , 'guanidine 
nucleotide ' , r) . 

phrase (guanosine, smallmolecule, [guanosine, triphosphate] , 'guanosin 
e triphosphate' ,r) . 

phrase (guanosine, smallmolecule, [guanosine , diphosphate] , 'guanosine 
diphosphate ' , r) . 

phrase (h4, cell, [h4, cell, line] , 'H4 cell line',r). 

phrase (h4, cell, [h4 , human, neuroglioma, cells] , ' H4 , human, neuroglioma 
, cells ' , r) . 

phrase (ha, protein, [ha, '-', '[ 1 , delta, ']', phpkb] ,' HA- [Delta] PHPK 
B\r) . 

phrase (hla, protein, [hla, ' - * , dr7] , ' HLA-DR7 r , r) . 

phrased, protein, [i, '[', kappa, ' ] ' , b, ' - 1 , ' [ ' , beta, ' ] ' ] , ' 1 [ka 
ppa] B- [beta] 1 , r) . 

phrased, protein, [i, 1 [\ kappa, '] ' ,b, ' [', alpha, ' ] ' ] , ' I [kap 
pa] B- [alpha] ' , r) . 

phrased, protein, [i, ■ [' , kappa, '] ' ,b] , 'I [kappa] B' , r) . 
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interleukin, protein, [interleukin, ! - ' , 1 , beta, converting, enzy 
interleukin-1 beta converting enzyme' ,r). 
jurkat, cell, [jurkat, cell], 'Jurkat cell r , r) . 
jurkat, cell, [jurkat, cells], 'Jurkat cell', r) . 
kif 3a, protein, [kif3a, '/' ,3,b] , 'KIF3A/3B' , r) . 
lbl, cell, [lbl, ' - ' ,drf , cells], 1 LBL-DR7 cells', r). 
lbl, cell, [lbl, ' - ' ,dr7, cells] , ' LBL-DR7 cells ' , r) . 
let, protein, [let , ' - ' , 23] , 'Let-23', r) . 
may, probability, [may, be] , possible, r) . 



ice, protein, [ice, ' / ' , ced, ' - ' , 3] , ' ICE/Ced-3 * , r) . 

il, gene, [il , ' - ' , 2 , gene] , 'gene encoding interleukin-2 * , r 

il, protein, [il, r - ! # 2] # ' interleukin-2 r) . 

in, interm, [in, the, case, of] , [] , r) . 

in, state, [in, the, anergic, state] , inactive, r) . 

inducible, cell, [inducible , h4 , cell] , 'inducible H4 cell ' , r 



interleukin, protein, [interleukin, 
interleukin, protein, [interleukin, 



'-' ,2] ,r) . 

' 3], 1 interleukin-3 



myc, protein, [myc, 
myc, protein, [myc, 
myc, protein, [myc, T - 
myc, protein, [myc, *- 



-', p70s6kd3e] , ' Myc-p70s6kD3E ' , r) , 
-', pdkl] , 'Myc-PDKl' ,r) . 
p70s6k] , 'Myc-p70s6k' ,r) . 
p70s6ke389d3e] , 'Myc-p70s6kE389D3E 1 



myr, protein, [myr, ' - ' , akt] , 'Myr-Akt ' , r) . 

n, protein, [n, ' - 1 , methyl, 1 - ' ,d, 1 - 1 , aspartate, receptor] , 1 N 
r) . 

n, protein, [n, 1 - ' , methyl, 1 - 1 ,d, ' - ' , aspartate] , ' NMDA 1 ) . 

native, cell, [native , h4 , cell] , 1 native H4 cellar). 

nf , protein, [nf ,'-','[ 1 , kappa, 1 ] , b ] , ' NF- [kappa] B ' ,r) . 

nh2, site, [nh2 , r -' # terminal] , 1 NH2 - terminal ! , r ) . 

nh2 , substance, [nh2 ,'-*, terminal , fragment ] , ' NH2 -terminal fr 

,r) . 

nih, cell, [nih, ' - 1 , 3 , t3 , fibroblasts] , 'NIH-3T3 fibroblasts' 

nih, cell, [nih, ' - ' , 1 3t3 1 , fibroblasts] , 'N1H-3T3 fibroblasts ' 

normal, substance, [normal, ntf] , 'normal NTF ' , r) . 

nuclear, protein, [nuclear, factor, kappa, b] ,' NF- [kappa] B 1 

plBOGlued, protein, [plSOGlued, -,arpl] , ' pl50Glued-Arpl 1 , r) . 
phosphate, phosphorylate2, [phosphate, incorporated, into] , 
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phosphorylate, r) . 

phrase (phosphatidylinositol , smallmolecule, [phosphatidylinositol , 1 
/ ' / '/S, triphosphate] , 'phosphatidylinositol 1,4,5-tripho 

sphate ' , r) . 

phrase (phosphoinositide, protein, [phosphoinositide, 1 - ' dependent, 

protein, kinase], 'PDKl r ,r). 
phrase (phospholipase, protein, [phospholipase, c, '-',1] , 'phospholip 
ase C-l T , r) . 

phrase (poly , protein, [poly, 1 ( 1 ,adp, * - ' ,ribose, 1 ) ' polymerase] , 'poly 
(ADP-ribose) polymerase' , r) . 

phrase (polyvinyl idene , structure, [polyvinyl idene , difluoride, memb 
ranes] , ' polyvinyl idene difluoride membranes' ,r) . 
phrase (presenilin, protein, [presenilin, 1] , 'presenilin l',r) . 
phrase (presenilin, protein, [presenilin, 2] , 1 presenilin, 2 1 , r) . 
phrase (productively, state, [productively , stimulated] , active, r) . 
phrase (protein, protein, [protein, tyrosine, kinase] , 'protein tyrosi 
ne kinase ' , r) . 

phrase (protein, protein, [protein, kinase, c] , 'protein kinase C',r) . 
phrase (ps2 , substance, [ps2 , '-' ,ctf] , 'presenilin 2 COOH- terminal fra 
gment 1 , r) . 

phrase (ps2 , substance, [ps2 , cleavage, fragment] , 'presenilin 2 cleava 
ge fragment ' , r) . 

phrase (pvdf, structure, [pvdf, membranes] , ' polyvinyl idene difluori 
de membranes ' , r) . 

phrase (raf, protein, [raf , ' -',!], 'Raf-1', r) . 
phrase (raf , protein, [raf, ' - ' , 1] , 'Raf -1 ' , r) . 
phrase (rapl , complex, [rapl, 1 - ' ,gtp] , 1 Rapl -GTP r ,r) . 
phrase (requirement , need2, [requirement, for], need,r). 
phrase (ser, smallmolecule , [ser, 19], ' Ser 19', r). 
phrase (ser, smallmolecule, [ser, 23], 'Ser 23', r). 

phrase (serine, substance, [serine, residues], 'serine residues', r 
) - 

phrase(src, domain, [src, homology, 2], ' Src homology 2',r). 
phrase(src, domain, [src, homology, 3], 'Src homology 3',r). 
phrase (srebp, protein, [srebp, 1 -', 1] , 'sterol-regulatory element bin 
ding protein l',r) . 

phrase (srebp, protein, [srebp, 1 -',2] , 'sterol-regulatory element bin 
ding protein 2 f , r) . 

phrase (sterol , protein, [sterol, ' - ' , regulatory, element , binding, prote 
in,l] , 'sterol-regulatory element binding protein l T ,r) . 
phrase (sterol, protein, [sterol, ' - ' , regulatory , element , binding, prote 
in, 2] , 'sterol-regulatory element binding protein 2',r) . 
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phrase(t, cell, [t , 1 - 1 , dr7] , 't-DR7',r). 

phrase (t 7 cell, [t, ' - ' , drt, ' / ' 7 b7, ' - 1 , 1] , ' t-DR7/B7-l ' , r) . 
phrase(t, cell, [t,cell], 'T cellar). 
phrase(t, cell, [t, cells], 'Tcell ! ,r). 

phrase (t , complex, [t, 1 - 1 , cell , receptor] , 'T-cell receptor 1 ,r) . 

phrase (t , cell, [t, ' - ' , dr7, cells] , 1 1 -DR7 cells ' , r) . 

phrase (t,cell, [t, ' - ' ,dr7, '/ r ,b7, 1 - ',1] , * t-DR7/B7-l 1 , r) . 

phrase (t, complex, [t , '- 1 , cell , antigen, receptor] , ' T-cell antigen rec 

eptor ' , r) . 

phrase (threonine, aminoacid, [threonine, 229], 'threonine 229', r) 

phrase (transcription, protein, [transcription, factor], 'transcript 
ion factor ' , r) . 

phrase (trypan, smallmolecule, 'trypan blue f ,r) . 
phrase (wt, protein, [wt, akt] , 'WTAkt ! ,r). 
phrase (zap, protein, [zap, 1 - 1 , 70] , 'ZAP-70 ' , r) . 
phrase (zdevd, smallmolecule, [zdevd, T - ' , fmk] , ' zDEVD-fmk' , r) . 
phrase (il, protein, [il, 1 - ' , 3] , ' interleukin-3 ' , r) . 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 
wdef 



ab, complex, antibody) . 
actin, protein, actin) . 
activated, state, active) . 
active, state, active) . 
ad, disease, 'Alzheimer' ' ' 's disease') . 
age, protein, T AGC f ) . 
akt, protein, ? AKT 1 ) . 
anergic, state, inactive) . 
anergic, state, inactive) . 
anergy, state, inactive) . 
antibody, complex, antibody) . 
antigen, substance, antigen) . 
aop, protein, 'Aop'). 
apoptosis, process, apoptosis) . 
bad, protein, 'BAD'). 
c3g, protein, 1 C3G 1 ) . 
1 ca2+ ' , smallmolecule, ' Ca2+ ' ) . 
cas, protein, 1 Cas ' ) . 
caspase, protein, caspase) . 
caspase, protein, caspase) . 
cbl, protein, 'Cbl'). 
ccrsrh, protein, 'CCRSrh') . 
cd28, protein, f CD28'). 
cells, structure, cell) . 
cholesterol , smallmolecule , cholesterol ; 
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cpp32 , protein, ( CPP32 ' ) . 
crkl, protein, 1 CrkL ' ) . 

ctf, substance, ' COOH- terminal fragment') . 

cytokine, smallmolecule , cytokine) . 

cytosol, structure, cytosol) . 

djnk, protein, ' D JNK 1 ) . 

djun, protein, 'DJun'). 

dynamitin, protein, dynamitin) . 

erk, protein, 1 ERK 1 ) . 

eto, smallmolecule, 1 ETO ' ) . 

etoposide, smallmolecule, etoposide) . 

fad, disease, 'familial Alzheimer' ' ' 's disease') 

fyn, protein, 'Fyn'). 

gdp, smallmolecule, 'GDP') . 

gelsolin, protein, gelsolin) . 

gpl2 0, protein, 'gpl2 0 * ) . 

grb2 , protein, ' Grb2 ' ) . 

gst, protein, 'glutathione S- transferase 1 ) . 

gtp, smallmolecule, 'GTP') . 

hsp70, protein, 1 HSP70 ' ) . 

human, species, human) . 

ikk, protein, ' IKK' ) . 

inactivated, state, inactive) . 

inactive, state, inactive) . 

jnk, protein, 'JNK'). 

jnk, protein, 'JNK'). 

jnk2, protein, ' JNK2 ' ) . 

kap3 , protein, kap3 ) . 

kdakt , protein, ' KDAkt ' ) . 

kinase, protein, kinase) . 

kinectin, protein, kinectin) . 

klc, protein, klc) . 

lamin, protein, lamin) . 

myosins , protein, myosins) . 

nmdar, protein, ' NMDAR 1 ) . 

nmdar2b, protein, ' NMDAR2B ' ) . 

ntf , substance, ! NH2 -terminal fragment') . 

p70s6k, protein, p70s6k) . 

p78s6k, protein, p78s6k) . 

parp, protein, 'poly (ADP-ribose) polymerase ' ) . 
pdkl, protein, ' PDK1 ' ) . 
peptides, protein, peptide) . 
pkb, protein, ' PKB T ) . 
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wdef (pkc , protein, 'protein kinase C). 
wdef {position, site, site), 
wdef (positions, site, site) . 
wdef (protease,protein, protease) . 
wdef (psl , protein, 'presenilin 1 T ) . 
wdef (ps2 , protein, 'presenilin 2 1 ) . 
wdef(rapl, protein, 'Rapl'). 
wdef(ras, protein, f Ras f )- 
wdef (receptors, substance, receptor) . 
wdef(rela, protein, ' RelA 1 ) . 
wdef (residues , substance, residue) . 
wdef (responsive , state, active). 
wdef(s6, protein, 'S6 T ). 

wdef (selectively , constraint, selective) . 
wdef (serll2, site, ' Serll2') . 
wdef(serl36, site, 'Serl36 ! ). 
wdef (ser32 , smallmolecule , 1 Ser32 1 ) . 
phrase (psl , protein 

wdef (ser36, smallmolecule, 'Ser36') . 

phrase (psl, protein, [psl , 1 - 1 , ctf ] , 'psl-ctf, 

wdef (sh2 , domain, ' SH2 ' ) . 

wdef (sh3 , domain, r SH3 ! ) . 

wdef (she, protein, 'She'). 

wdef (signalsome , complex, signal some) . 

wdef (sites, site, site) . 

wdef(sos, protein, ' Sos 1 ) ■ 

wdef (staurosporine, smallmolecule, staurosporine 

wdef (sts, smallmolecule, 1 STS 1 ) . 

wdef(tcr, complex, 'T-cell receptor 1 ). 

wdef (tetracycline, smallmolecule, tetracycline) 

wdef (thr22 9, aminoacid, ' Thr229') . 

wdef (thr308, aminoacid, r Thr308») . 

wdef(thr389, aminoacid, , Thr389'). 

wdef (threonine, aminoacid, threonine) . 

wdef (tyrosine , aminoacid, tyrosine) . 

wdef (unresponsive, state, inactive) . 

wdef (unstimulated, state, inactive) . 

wdef (zvad, smallmolecule, ' zVAD ! ) . 
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% lexsyn.pat 

% revised March 17, 2000 

% SYNTACTIC LEXICON FOR ACTIONS 

% Contains syntactic entries for action type words and phrases 

% 

% synp (+Wordl, + Wordlist, +Syn) 

% synp: Wordl is first word of phrase, Wordlist is list of words 
n phrase 

% synp: Syn is syntactic categorey 
% 

% synw (+ Word, + Syn) is same as synp except there is no wordlist 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%% 

synp (account , [account , for] ,v) . 

synp (account , [account , for] ,vp) . 

synp (accounted, [accounted, for] ,ved) . 

synp (accounted, [accounted, for] ,ven) . 

synp (accounting, [accounting, for] ,ving) . 

synp (accounting, [accounting, for] ,n) . 

synp (accounts, [accounts , for] ,vp) . 

synp (add, [add, up] ,vp) . 

synp (add, [add, up] ,v) . 

synp (added, [added, up] ,ved) . 

synp (added, [added, up] ,ven) . 

synp (adding, [adding, up] ,n) . 

synp (adding, [adding, up] ,ving) . 

synp(adds, [adds, up] ,vp) . 

synp (am, [am, a, means , of , producing] ,vp) . 

synp (am, [am, due, to] ,vp) . 

synp (are, [are , a, means , of , producing] ,vp) . 

synp (are, [are, due, to] ,vp) . 

synp (as, [as , a, result , of ] ,prep) . 

synp (attributable, [attributable , to] , vp) . % ? 

synp (attributed, [attributed, to] ,ven) . 

synp (based, [based, on] ,ven) . 

synp (based, [based, upon] ,ven) . i 

synp (be, [be , a, means , of , producing] ,v) . 

synp (be , [be , due , to] , v) . 

synp (because, [because, of] ,prep) . 

synp (been, [been, a, means , of , producing] ,ven) . 

synp (been, [been, due, to] ,ven) . 

synp (being, [being, a, means, of , producing] ,n) . 

synp (being, [being, a, means , of , producing] ,ving) . 
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synp (being, [being, due, to] ,n) . 
synp (being, [being, due, to] ,ving) . 
synp (caused, [caused, by] ,ved) . 
synp (caused, [caused, by] ,ven) . 
synp (convey, [convey, a, signal] ,v) . 
synp (convey, [convey, a, signal] ,vp) . 
synp (conveyed, [conveyed, a, signal] ,ved) . 
synp (conveyed, [conveyed, a, signal] ,ven) . 
synp (conveying, [conveying, a, signal] ,ving) . 
synp (conveying, [conveying, a, signal] ,n) . 
synp (conveys, [conveys, a, signal] , vp) . 
synp (dissociate, [dissociate, from] ,vp) . 
synp (dissociate, [dissociate, from] ,v) . 
synp (dissociated, [dissociated, from] ,ved) . 
synp (dissociated, [dissociated, from] ,ven) . 
synp (dissociates, [dissociates, from] ,vp) . 
synp (dissociating, [dissociating, from] ,n) . 
synp (dissociating, [dissociating, from] ,ving) . 
synp (dissociation, [dissociation, from] ,n) . 
synp (down, [down, 1 - 1 , regulate] , v) . 

synp (down, [down, ' - ' , regulate] ,vp) . % A down-regulates B A 
--> B 

synp (down, [down, * - ! , regulated] , ved) . 

synp (down, [down, ' - 1 , regulated] , ven) . 

synp (down, [down, 1 - 1 , regulates] , vp) . 

synp (down, [down, * - 1 , regulating] ,n) . 

synp (down, [down, * - 1 , regulating] ,ving) . 

synp (down, [down, ' - ' , regulation] ,n) . 

synp (due, [due, to, the, fact , that] , adj ) . 

synp (due, [due, to] , ad j ) . % ? 

synp (form, [form, complex] ,v) . 

synp (form, [form, complex] ,vp) . 

synp (formation, [formation, of, complex] ,n) . 

synp (formed, [formed, complex] ,ved) . 

synp (formed, [formed, complex] ,ven) . 

synp (forming, [forming, complex] ,n) . 

synp (forming, [forming, complex] ,ving) . 

synp (forms, [forms, complex] ,vp) . 

synp (had, [had, an, active, role, in] , ved) . 

synp (had, [had, an, act ive , role , in] ,ven) . 

synp (has, [has , an, active , role , in] , vp) . 

synp (have, [have , an, act ive , role , in] ,v) . 

synp (have, [have , an, act ive , role , in] , vp) . 
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synp (having, [having, an, active, role , in] ,n) . 
synp (having, [having, an, active , role , in] ,ving) . 
synp(is, [is, a, means, of , producing] ,vp) . 
synp (is, [is, due, to] ,vp) . 

synp (functions , [functions , as , a , negative , regulator , of ] ,vp) . 

synp (function, [function, as, a, negative, regulator, of ] ,vp) . 

synp (lead, [lead, to] ,v) . 

synp (leads, [leads, to] ,vp) . 

synp (leading, [leading, to] ,n) . 

synp (leading, [leading, to] ,ving ) . 

synpdeads, [leads, to] ,vp ). 

synp (led, [led, to] ,ved) . 

synp (led, [led, to] ,ven) . 

synp (may, [may , be , responsible , for] ,vp) . 

synp (mediate, [mediate, a, signal], v) . %A mediates a signal to 
B 

synp (mediate, [mediate, a, signal] , vp) . 
synp (mediated, [mediated, a, signal], ved) . 
synp (mediated, [mediated, a, signal] , ven) . 
synp (mediates, [mediates, a, signal] , vp) . 
synp (mediating, [mediating, a, signal] , n) . 
synp (mediating, [mediating, a, signal], ving) . 
synp (mediation, [mediation, of , a, signal], n). 



synp (n, 


[n, ' - 


' , acetylate] , v) . 


synp (n, 


[n, * - 


1 , acetylate] ,vp) . 


synp (n, 


[n, ' - 


' , acetylated] ,ved) . 


synp (n, 


[n, ' - 


' , acetylated] ,ven) . 


synp (n, 


[n, 1 - 


' ,acetylates] ,vp) . 


synp (n, 


[n, 1 - 


1 , acetylating] ,n) . 


synp (n, 


[n, ' - 


' , acetylating] ,ving) 


synp (n, 


[n, ' - 


1 , acetylation] ,n) . 


synp (n, 


[n, 1 - 


1 , acylate] , v) . 


synp (n, 


[n, ' - 


1 , acylate] , vp) . 


synp (n, 


[n, ' - 


' , acylated] , ved) . 


synp (n, 


[n, ' - 


' , acylated] ,ven) . 


synp (n, 


[n, ■ - 


1 , acylates] , vp) . 


synp (n, 


[n, ' - 


1 , acylating] , n) . 


synp (n, 


[n, ' - 


' , acylating] ,ving) . 


synp (n, 


[n, ] - 


1 , acylation] , n) . 


synp (n, 


[n, 1 - 


1 ,glycosylate] ,v) . 


synp (n, 


[n, 1 - 


1 glycosylate] ,vp) . 


synp (n, 


[n, ' - 


1 /glycosylated] ,ved) 


synp (n, 


[n, ' - 


1 /glycosylated] ,ven) 
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n, 
n, 
n, 
n, 
n, 
o, 
o, 
o, 
o, 
o, 
o, 
o, 
o, 



[n, 
[n, 
[n, 
[n, 
[n, 
[o, 
[o, 
[o, 
[o, 
[o, 
[o, 
[o f 
to, 



glycosylates] ,vp) . 
glycosylating] ,n) . 
glycosylating] ,ving) . 
glycosylation] , n) . 
terminal , proteolysis] , n) 
glycosylate] ,v) . 
glycosylate] , vp) . 
glycosylated] ,ved) . 
glycosylated] ,ven) . 
glycosylates] ,vp) . 
glycosylating] , n) . 
glycosylating] , ving) . 
glycosylation] , n) . 
[only, after] ,prep) . 



only, 

prolyl , [prolyl , ' - 1 , 4 , 
prolyl , [prolyl , ' - 1 , 4 , 
prolyl , [prolyl , 1 - 1 , 4 , 
prolyl , [prolyl , ' - 1 , 4 , 
prolyl , [prolyl , ' - 1 , 4 , 
prolyl, [prolyl, ' - 1 ,4, 
prolyl, [prolyl, ' - 1 ,4, 
prolyl , [prolyl , 1 - * , 4 , 
result, [result , from] ,v) . 
result, [result , from] ,vp) . 
result, [result, in] ,v) . 
result, [result, in] ,vp) . 
resulted, [resulted, from] ,ved) . 
resulted, [resulted, from] ,ven) . 
resulted, [resulted, in] ,ved) . 
resulted, [resulted, in] ,ven) . 
resulting, [resulting, from] ,n) . 
resulting, [resulting, from] ,ving) 
resulting, [resulting, in] ,n) . 
resulting, [resulting, in] ,ving) . 
results, [results , from] ,vp) . 
results, [results, in] ,vp) . 



, hydroxy late] , v ) . 
, hydroxylate] ,vp) . 
, hydroxylated] ,ved ) . 
, hydroxylated] ,ven ) . 
, hydroxylates] ,vp) . 
, hydroxy 1 at ing] , n ) . 
, hydroxylating] , ving ) 
, hydroxy lat ion] ,n) . 



synp 


(set , 


[set , 


free] 


,v) . 


synp 


(set , 


[set , 


free] 


,v) . 


synp 


(set , 


[set, 


free] 


, ved) 


synp 


(set , 


[set , 


free] 


, ved) 


synp 


(set , 


[set , 


free] 


/ ven) 


synp 


(set , 


[set , 


free] 


, ven) 


synp 


(set , 


[set , 


free] 


/Vp) . 
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synp (set, [set, free] ,vp) . 

synp(sets, [sets, free] ,vp) . 

synp(sets / [sets, free] ,vp) . 

synp (setting, [setting, free] ,n) . 

synp (setting, [setting, free],n). 

synp (setting, [setting, free] ,ving) . 

synp (setting, [setting, free] ,ving) . 

synp (suppress, [suppress, activity, of] ,v) . 

synp (suppress, [suppress, activity, of] ,vp) . 

synp (suppressed, [suppressed, activity, of ] ,ved) . 

synp (suppressed, [suppressed, activity, of] ,ven) . 

synp (suppresses , [suppresses, activity, of ] ,vp) . 

synp (suppressing, [suppressing, activity, of],n). 

synp (suppressing, [suppressing, activity, of],ving). 

synp (suppression, [suppression, of , activity, of],n) . 

synp (switch, [switch, on, the, activity, of] ,vp) . 

synp (switched, [switched, on, the, activity, of] ,ved) . 

synp (switched, [switched, on, the, activity, of] ,ved) . 

synp (switched, [switched, on, the, activity, of] ,ved) . 

synp (switched, [switched, on, the, activity, of] ,ved) . 

synp (switched, [switched, on, the, activity, of] ,ved) . 

synp (switches, [switches, on, the, activity, of] ,vp) . 

synp (up, [up, 1 - 1 , regulate] ,v) . % A up-regulates B B A 

synp (up, [up, ' - 1 , regulate] ,vp) . % A up-regulates B B --> A 

synp (up, [up, ' - ' , regulated] , ved) . 

synp(up, [up, '- 1 , regulated] ,ven) . % A up-regulates B B --> A 
synp (up, [up, ' - ' , regulates] , vp) . 

synp (up, [up, * - 1 , regulating] ,n) . % A up-regulates B B --> A 

synp (up, [up, ! -' , regulating] ,ving) . % A up-regulates B B A 

synp (up, [up, 1 - ' , regulation] ,n) . 

synp(was, [was , a, means , of , producing] , ved) . 

synp (was , [was , due , to] , ved) . 

synp (were, [were , a, means , of , producing] , ved) . % ? 
synp (were , [were , due , to] , ved) . 
synw (acetylate, v) . 

synw (acetylate , vp) . i 

synw (acetylated, ved) . 

synw (acetylated, ven) . 

synw (acetylates, vp) . 

synw (acetylating, n) . 

synw (acetylating, ving) . 

synw (acetylation, n) . 

synw (activate, v) . 
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activate, vp) . 
activated, ved) . 
activated, ven) . 
activates , vp) . 
activating, n) . 
activating, ving) . 
activation, n) . 
add, v) . 
add, vp) . 
added, ved) . 
added, ven) . 
adding, n) . 
adding, ving) . 
addition, n) . 
adds,vp) . 
after, prep) . 
aggregate ,v) . 
aggregate ,vp) . 
aggregated , ved) . 
aggregated ,ven) . 
aggregates , vp) . 
aggregating , n) . 
aggregat ing , ving ) 
aggregation ,n) . 
arrest , n) . 
arrest, v) . 
arrest, vp) . 
arrested, ved) . 
arrested, ven) . 
arresting, n) . 
arresting, ving) . 
arrests , vp) . 
associate, v) . 
associate, vp) . 
associated, ved) . 
associated, ven) . 
associates, vp) . 
associating, n) . 
associating, ving) . 
association, n) . 
attach ,v) . 
attach, vp) . 
attached ,ved) . 
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synw (attached ,ven) . 
synw (attaches f .vp) . 
synw (attaching , n) . 
synw (attaching ,ving) . 
synw (attachment , n) . 
synw (bind, v) . 
synw (bind, vp) . 
synw (binding, n) . 
synw (binding, ving) . 
synw (binds, vp) . 
synw (block, v) . 
synw (block, vp) . 
synw (blockage, n) . 
synw (blocked, ved) . 
synw (blocked, ven) . 
synw (blocking, n) . 
synw (blocking, ving) . 
synw ( blocks, vp) . 
synw (bound, ved) . 
synw (bound, ven) . 
synw (break, v) . 
synw (break, vp) . 
synw (breakage, n) . 
synw (breaking, n) . 
synw (breaking, ving) . 
synw (breaks , vp) . 
synw (broke, ved) . 
synw (broken, ven) . 
synw (catalyzation, n) . 
synw (catalyze, v) . 
synw (catalyze, vp) . 
synw (catalyzed, ved) . 
synw (catalyzed, ven) . 
synw (catalyzes, vp) . 
synw (catalyzing, n) . 
synw (catalyzing, ving) . 
synw (causation, n) . 
synw (cause, n) . 
synw (cause, v) . 
synw (cause, ven) . 
synw (cause, vp) . 
synw (caused, ved) . 
synw (causes, vp) . 
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synw (causing, n) . 

synw (causing, ving) . 

synw (cleavage, n) . 

synw (cleave, v) . 

synw (cleave, vp) . 

synw (cleaved, ved) . 

synw (cleaved, ven) . 

synw (cleaves , vp) . 

synw (cleaving, n) . 

synw (cleaving, ving) . 

synw (coimmunoprecipitate ,v) . 

synw (coimmunoprecipitate, vp) . 

synw (coimmunoprecipitated ,ved) . 

synw (coimmunoprecipitated ,ven) . 

synw (coimmunoprecipitates, vp) . 

synw (coimmunoprecipitating ,n) . 

synw (coimmunoprecipitating ,ving) . 

synw (coimmunoprecipitation ,n) . 

synw (combination , n) . 

synw (combine ,v) . 

synw (combine , vp) . 

synw (combined ,ved) . 

synw (combined ,ven) . 

synw (combines, vp) . 

synw (combining , n) . 

synw (combining ,ving) . 

synw (conjugate ,v) . 

synw (conjugate ,vp) . 

synw (conjugated ,ve) . 

synw (conjugated ,ved) . 

synw (conjugates, vp) . 

synw (conjugating ,n) . 

synw (conjugating ,ving) . 

synw (conjugation , n) . 

synw (connect , vp). 

synw (connect ,v) . 

synw (connected ,ve) . 

synw (connected ,ved) . 

synw (connecting ,n) . 

synw (connecting ,ving) . 

synw (connection ,n) . 

synw (connects , vp) . 

synw (constrain, v) . 
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synw (constrain, vp) . 
synw (constrained, ved) . 
synw (constrained, ven) . 
synw (constraining, n) . 
synw (constraining, ving) . 
synw (constrains , vp) . 
synw (constraint , n) . 
synw (coprecipitate, v) . 
synw (coprecipitate, vp) . 
synw (coprecipitated, ved) . 
synw (coprecipitated, ven) . 
synw(coprecipitates, vp) . 
synw (coprecipitating,n) . 
synw (coprecipitating, ving) . 
synw (coprecipitation ,n) . 
synw (copurif ication ,n) . 
synw (copurif ied ,ved) . 
synw (copurif ied ,ven) . 
synw (copurif ies,vp) . 
synw (copurif y ,vp) . 
synw (copurif y, v) . 
synw (copurif ying ,n) . 
synw (copurif ying ,ving) . 
synw (couple , vp) . 
synw (couple, v) . 
synw (coupled, ved) . 
synw (coupled, ven) . 
synw (couples, vp) . 
synw (coupling, n) . 
synw (coupling, ving) . 
synw (cut , n) . 
synw (cut , v) . 
synw (cut , ved) . 
synw (cut , ven) . 
synw (cut , vp) . 
synw (cuts, vp) . 
synw (cutting, n) . 
synw (cutting, ving) . 
synw (deactivate, v) . 
synw (deactivate, vp) . 
synw (deactivated, ved) . 
synw (deactivated, ven) . 
synw (deactivates, vp) . 
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synw (deactivating, n) . 
synw (deactivating, ving) . 
synw (deactivation, n) . 
synw (death, n) . 
synw (demethylate, v) . 
synw (demethylate, vp) . 
synw(demethylated,ved) . 
synw(demethylated, ven) . 
synw (demethylates, vp) . 
synw (demethylating, n) . 
synw (demethylating, ving) . 
synw (demethylation, n) . 
synw(dephosphorylate, v) . 
synw (dephosphorylate , vp) . 
synw (dephosphorylated, ved) . 
synw (dephosphorylated, ven) . 

synw (dephosphorylates, vp) . 

synw (dephosphorylating, n) . 

synw(dephosphorylating, ving) . 

synw(dephosphorylation / n) . 

synw (die, v) . 

synw (die , vp) . 

synw (died, ved) . 

synw (died, ven) . 

synw (dies, vp) . 

synw (disassemble, v) . 

synw (disassemble, vp) . 

synw (disassembled, ved) . 

synw (disassembled, ven) . 

synw (disassembles, vp) . 

synw (disassembling, n) . 

synw (disassembling, ving) . 

synw (disassembly, n) . 

synw (discharge, n) . 

synw (discharge, v) . 

synw (discharge, vp) . 

synw (discharged, ved) . 

synw (discharged, ven) . 

synw (discharges, vp) . 

synw (discharging, n) . 

synw (discharging, ving) . 

synw (disengage, v) . 

synw (disengage, vp) . 
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synw (disengaged, ved) . 
synw (disengaged, ven) . 
synw (disengagement , n) . 
synw (dis engage s,vp) . 
synw (disengaging, n) . 
synw (disengaging, ving) . 
synw (divide, v) . 
synw (divide, vp) . 
synw (divided, ved) . 
synw (divided, ven) . 
synw (divides , vp) . 
synw (dividing, n) . 
synw (dividing, ving) . 
synw (division, n) . 
synw (dying, n) . 
synw (dying, ving) . 
synw (enhance, v) . 
synw (enhance, vp) . 
synw (enhanced, ved) . 
synw (enhanced, ven) . 
synw (enhancement , n) . 
synw (enhances, vp) . 
synw ( enhanc i ng , n ) . 
synw (enhancing, ving) . 
synw (express, v) . 
synw (express, vp) . 
synw (expressed, ved) . 
synw (expressed, ved) . 
synw (expressed, ven) . 
synw (expresses, vp) . 
synw (expressing, n) . 
synw (expressing, n) . 
synw (expressing, ving) . 
synw (expression, n) . 
synw (generate, v) . 
synw (generate, vp) . 
synw (generated, ved) . 
synw (generated, ven) . 
synw (generates, vp) . 
synw (generating, n) . 
synw (generating, ving) . 
synw (generation, n) . 
synw (hew, v) . 
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synw(hew,vp) . 
synw (hewed, ved) . 
synw (hewed, ven) . 
synw (hewing, n) . 
synw (hewing , ving) . 
synw (hews, vp) . 
synw (hinder , v) . 
synw (hinder , vp) . 
synw (hindered, ved) . 
synw (hindered ,ven) . 
synw (hindering, n) . 
synw (hindering, ving) . 
synw (hinders , vp) . 
synw (hindrance, n) . 
synw (inactivate, v) . 
synw (inactivate, vp) . 
synw (inactivated, ved) . 
synw (inactivated, ven) . 
synw (inactivates , vp) . 
synw (inactivating, n) . 
synw (inactivating, ving) . 

synw (inact i vat ion, n) . 

synw (incite, v) . 

synw (incite, vp) . 

synw (incited, ved) . 

synw (incited, ven) . 

synw (incitement , n) . 

synw (incites, vp) . 

synw (inciting, n) . 

synw (inciting, ving) . 

synw ( induce , v) . 

synw (induce , vp) . 

synw (induced, ved) . 

synw (induced, ven) . 

synw ( induce s,vp) . 

synw (inducing, n) . 

synw (inducing, ving) . 

synw (induction, n) . 

synw (influence, n) . 

synw (influence, v) . 

synw (influence, vp) . 

synw (influenced, ved) . 

synw (influenced, ven) . 
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synw (influences, vp) . 

synw (influencing, n) . 

synw (influencing, ving) . % ? 

synw (inhibit ,v) . 

synw (inhibit ,vp) . 

synw (inhibited, ved) . 

synw (inhibited, ven) . 

synw (inhibiting, n) . 

synw (inhibiting, ving) . 

synw (inhibition, n) . 

synw (inhibits, vp) . 

synw (initiate, v) . 

synw (initiate, vp) . 

synw (initiated, ved) . 

synw (initiated, ven) . 

synw (initiates ,vp) . 

synw (initiating, n) . 

synw (initiating, ving) . 

synw (initiation, vp) . 

synw (instigate, v) . 

synw (instigate, vp) . 

synw (instigated, ved) . 

synw (instigated, ven) . 

synw ( instigates , vp) . 

synw (instigating, n) . 

synw (instigating, ving) . 

synw (instigation, n) . 

synw ( interact , v) . 

synw (interact , vp) . 

synw (interacted, ved) . 

synw (interacted, ven) . 

synw (interacting, n) . 

synw (interacting, ving) . 

synw (interaction, n) . 

synw (interactions, n) . 

synw (interacts , vp) . 

synw ( j oin , vp) . 

synw ( join, v) . 

synw (joined, ved) . 

synw (j oined, ven) . 

synw (joining, n) . 

synw (joining, ving) . 

synw( joins, vp) . 
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synw (juncture, n) . 
synw (liberate, v) . 
synw (liberate, vp) . 
synw (liberated, ved) . 
synw (liberated, ven) . 
synw (liberates, vp) . 
synw (liberating, n) . 
synw (liberating, ving) . 
synw (liberation, n) . 
synw (limit , v) . 
synw (limit , vp) . 
synw (limitation, n) . 
synw (limited, ved) . 
synw (limited, ven) . 
synw (limiting, n) . 
synw (limiting, ving) . 
synw (limits , vp) . 
synw (link, n) . 
synw (link, v) . 
synw (link, vp) . 
synw (linked, ved) . 
synw (linked, ven) . 
synw ( linking, n) . 
synw (linking, ving) . 
synw (links, vp) . 
synw (mediate, v) . 
synw (mediate, vp) . 
synw (mediated, ved) . 
synw (mediated, ven) . 
synw (mediates, vp) . 
synw (mediating, n) . 
synw (mediating, ving) . 
synw (mediation, n) . 
synw (me thy late, vp) . 
synw (methylate, v ) . 
synw (methylated, ved ) . 
synw (methylated, ven ) . 
synw (methyl at es, vp) . 
synw (methyl at ing, n ) . 
synw (methylating, ving ) . 
synw (methylation, n) . 
synw (modification, n) . 
synw(modif ied, ved) . 
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synw (modified, ven) . 

synw (modifies, vp) . 

synw (modify, v) . 

synw (modify, vp) . 

synw (modifying, n) . 

synw (modifying, ving) . 

synw (mutate , v) . 

synw (mutate , vp) . 

synw (mutated, ved) . 

synw (mutated, ven) . 

synw (mutates ,vp) . 

synw (mutating, n) . 

synw (mutating, ving) . 

synw (mutation, n) . 

synw (overexpress , v) . 

synw (overexpress , vp) . 

synw (overexpressed, ved) . 

synw (overexpressed, ven) . 

synw (overexpresses , vp) . 

synw (overexpressing, n) . 

synw (over expressing, ving) . 

synw (overexpression, n) . 

synw (pair , v) . 

synw (pair , vp) . 

synw (paired, ved) . 

synw (paired, ven) . 

synw (pairing, n) . 

synw (pairing, ving) . 

synw (pairs , vp) . 

synw (phosphorylate , n) . 

synw(phosphorylate, vp) . 

synw (phosphorylated, ved) . 

synw (phosphorylated, ven) . 

synw (phosphorylates , vp) . 

synw (phosphorylating, n) . 

synw (phosphorylating, ving) . 

synw (phosphorylation, n) . 

synw (promote, v) . 

synw (promote, vp) . 

synw (promoted, ved) . 

synw (promoted, ven) . 

synw (promotes , vp) . 

synw (promoting, n) . 
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synw (promoting, ving) . 
synw (promotion, n) . 
synw (prompt , n) . 
synw (prompt , v) . 
synw (prompt ,vp) . 
synw (prompted, ved) . 
synw (prompted, ven) . 
synw (prompting, n) . 
synw (prompting, ving) . 
synw (prompt s,vp) . 
synw (react , v) . 
synw (react ,vp) . 
synw (reacted, ved) . 
synw (reacted, ven) . 
synw (reacting, n) . 

synw (reacting, ving) . 

synw (reaction, n) . 

synw (reacts , vp) . 

synw (regulate ,v) . 

synw (regulate , vp) . 

synw (regulated, ved) . 

synw (regulated, ven) . 

synw (regulates ,vp) . 

synw (regulating, n) . 

synw (regulating, ving) . 

synw (regulation, n) . 

synw (release , n) . 

synw (release, v) . 

synw (release, vp) . 

synw (released, ved) . 

synw (released, ven) . 

synw (releases, vp) . 

synw (releasing, n) . 

synw (releasing, ving) . 

synw (removal , n) . 

synw (remove, v) . 

synw (remove, vp) . 

synw (removed, ved) . 

synw (removed, ven) . 

synw (removes, vp) . 

synw (removing, n) . 

synw (removing, ving) . 
synw (replace, v) . 
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synw (replace, vp) . 
synw (replaced, ved) . 
synw (replaced, ven) . 
synw (replacement , n) . 
synw (replaces, vp) . 
synw (replacing, n) . 
synw (replacing, ving) . 
synw (repress , vp) . 
synw (repress , v) . 
synw (repressed, ved) . 
synw (repressed, ven) . 
synw (represses, vp) . 
synw (repressing, n) . 
synw (repressing, ving) . 
synw (repression, n) . 
synw ( require, v) . 
synw (require, vp) . 
synw (required, ved) . 
synw (required, ven) . 
synw (requirement , n) . 
synw (requires, vp) . 
synw (requiring, n) . 
synw (requiring, ving) . 
synw (restrain, vp) . 
synw (restrain, v) . 
synw (restrained, ved) . 
synw (restrained, ven) . 
synw (restraining, n) . 
synw (restraining, ving) . 
synw (restrains, vp) . 
synw (restraint , n) . 
synw (sensitization, n) . 
synw (sensitize, vp) . 
synw (sensitize, v) . 
synw (sensitized, ved) . 
synw (sensitized, ven) . 
synw (sensitizes, vp) . 
synw (sensitizing, n) . 
synw (sensitizing, ving) . 
synw (separate, v) . 
synw (separate, vp) . 
synw (separated, ved) . 
synw (separated, ven) . 
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synw (separates , vp) . 
synw (separating, n) . 
synw (separating, ving) . 
synw (separation, n) . 
synw (sever, v) . 
synw ( sever, vp) . 
synw (severance , n) . 
synw (severed, ved) . 
synw (severed, ven) . 
synw (severing, n) . 
synw (severing, ving) . 
synw (severs , vp) . 
synw (signal , v) . 
synw (signal , vp) . 
synw (signaled, ved) . 
synw (signaled, ved) . 
synw (signaled, ven) . 
synw (signaling, n) . 
synw (signaling, ving) . 
synw (signals, vp) . 
synw (split , n) . 
synw (split , v) . 
synw (split , ved) . 
synw (split , ven) . 
synw (split , vp) . 
synw (splits, vp) . 
synw (splitting, n) . 
synw (split ting, ving) . 
synw (stimulate, v) . 
synw (stimulate, vp) . 
synw (stimulated, ved) . 
synw (stimulated, ven) . 
synw (stimulates , vp) . 
synw (stimulating, n) . 
synw (stimulating, ving) . 
synw (stimulation, n) . 
synw (substitute, v) . 
synw (substitute, vp) . 
synw (substituted, ved) . 
synw (substituted, ven) . 
synw (substitutes, vp) . 
synw (substituting, n) . 
synw (substituting, ving) . 
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synw (substitution, n) . 
synw "(suppress/- vp) . 
synw (suppress, v) . 
synw (suppressed, ved) . 
synw (suppressed, ven) . 
synw (suppresses, vp) . 
synw (suppressing, n) . 
synw (suppressing, ving) . 
synw (suppression, n ). 
synw (tie, n) . 
synw ( tie, v) . 
synw (tie , vp) . 
synw (tied, ved) . 
synw (tied, ven) . 
synw (ties ,vp) . 
synw (transcribe , v) . 
synw (transcribe, vp) . 
synw (transcribed, ved) . 
synw (transcribed, ven) . 
synw (transcribes, vp) . 
synw (transcribing, n) . 
synw ( transcribing, ving) . 
synw (transcription, n) . 
synw (tying, n) . 
synw (tying, ving) . 
synw (ubiquitinization, n) . 
synw (ubiquitinize, v) . 
synw (ubiquitinize, vp) . 
synw (ubiquitinized, ved) . 
synw (ubiquitinized, ven) . 
synw(ubiquitinizes, vp) . 
synw (ubiquitini zing, n) . 
synw (ubiquitini zing, ving) . 
synw (urge, n) . 
synw (urge, v) . 
synw (urge ,vp) - 
synw (urged, ved) . 
synw (urged, ven) . 
synw (urges ,vp) . 
synw (urging , n) . 
synw (urging, ving) . 

% the following are verbs connected with complexes 
synw (form, v) . 
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synw (form, vp) . 
synw (forms, vp) . 
synw (formed, ved) . 
synw (formed, ven) . 
synw (forming, n) . 
synw (formation, n) . 
synw (assemble, v) . 
synw (assemble , vp) . 
synw (assembles , vp) . 
synw (assembled, ved) . 
synw (assembled, ven) . 
synw (assembling, n) . 
synw (assembly , n) . 
synw (dissassemble, v) . 
synw (dissassemble, vp) . 
synw (dissassembles, vp) . 
synw (dissassembled, ved) . 
synw (dissassembled, ven) . 
synw (dissassembling, n) . 
synw (dissassembly, n) . 
synw (dissociate, v) . 
synw (dissociate, vp) . 
synw (dissociates , vp) . 
synw (dissociated, ved) . 
synw (dissociated, ven) . 
synw (dissociating, n) . 
synw (dissociation, n) . 
synw (recruit , v) . 
synw (recruit , vp) . 
synw (recruits, vp) . 
synw (recruited, ved) . 
synw (recruited, ven) . 
synw (recruiting, n) . 
synw (recruitment , n) . 
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% lexsemact . pat 

% revised March 17, 2000 

% SEMANTIC LEXICON OF ACTIONS 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%%%%%%% 

% For genomics - the grammar tests for semantic and syntactic cate 
gories 

% separately for action type of categories; for substances the lex 
ical 

% entries are the same as in the medical area 

% action type phrases have two entries: a semantic entry and a syn 
tactic entry 

% This lexicon contains the semantic entries for words and phrases 

% semp is a lexical entry for phrasal lexicon 

% semp (+Wordl, +Sem, +Wordlist , +Targetf orm, +Features) 

% semp specifies a semantic lexical definition for the genomics li 
terature 

% semp is equivalent to the predicate "phrase" in the medical area 
% semp: Wordl is first word of phrase, Sem is semantic category 
% semp: Wordlist is list of words in phrase, Targetform is output 
form 

% semp: Features is a list of 2 elements or the atom "def" represe 
nting defaul 

% semp: Features 1st element is rev or nrev meaning reversed or no 
t reversed 

% semp: Features 2nd element is a # specifying number of arguments 
for action 

% semp: Features = def is equivalent to a list - [nrev, 2] 
% in case action has 1 argument, use [1,_] 

%semw is a lexical entry for single word 
% semw(+Word, +Sem, +Targetform, +Features) 

% semw: the arguments are the same as for semp except there is no 
Wordlist 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%%%%%%%% 

: - multifile (semp/5) . 
:- multifile (semw/4) . 

semp (account , cause, [account , for] , cause, [def]) . 
semp (accounted, cause, [accounted, for] , cause, [def]) . 
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semp (accounting, cause, [accounting, for] , cause, [def ] ) . 
semp (accounts, cause, [accounts, for] , cause, [def]) . 
semp(add, attach, [add, up], attach, [def]). 
semp (added, attach, [added, up], attach, [def]). 
semp (adds, attach, [adds, up], attach, [def]). 
semp (are, cause, [are, a, means, of , producing] , cause, [def]) . 
semp (are, cause, [are, due, to] , cause, [2 , rev] ) . 
semp (as , cause, [as , a, result , of ] , cause, [2 , rev] ) . 
semp (attributable, cause, [attributable, to] , cause, [2, rev]) . 
semp (attributed, cause, [attributed, to] , cause, [2, rev]) . 
semp (based, cause, [based, on] , cause, [2, rev]) . 
semp (based, cause, [based, upon] , cause, [2, rev] ) . 
semp (because, cause, [because, of] , cause, [2, rev]) . 
semp(convey, signal, [conveys, a, signal], signal, [def]). 
semp (conveyed, signal, [conveyed, a, signal], signal, [def]). 
semp (conveying, signal, [conveying, a, signal], signal, [def]). 
semp (conveys, signal, [conveys, a, signal] , signal , [def]). 
semp (dissociate, release, [dissociate, from], release, [def ]) . 
semp (dissociated, release, [dissociated, from] , release, [def] ) . 
semp (dissociates, release, [dissociates, from] , release, [def]) . 
semp (dissociation, release, [dissociation, from], release, [def ] ) 



[def] ) . 
signal, [def] ) 
signal , [def] ) 
, signal, [def] 



A down- 



A down 



A down 



A dow 



semp (down, signal , [down, f - f , regulate] , signal , 
regulates B A --> B 

semp (down, signal, [down, ' - 1 , regulated] , 
-regulates B A --> B 

semp (down, signal , [down, '-' , regulates] , 
-regulates B A --> B 

semp (down, signal , [down, 1 - 1 , regulation] 
n- regulates B A - - > B 

semp (due, cause, [due, to, the, fact , that] , cause, [2, rev] ) . 
semp (due, cause, [due, to] , cause , [2,rev]). 
semp (form, attach, [form, complex], attach, [def]). 
semp (formation, attach, [formation, of, complex], attach, 
semp (formed, attach, [formed, complex], attach, [def]). 
semp (forms, attach, [forms, complex], attach, [def]). 
semp (had, cause, [had, an, active, role, in] , cause, [def]) . 
semp (has, cause, [has, an, active, role, in] , cause, [def]) . 
semp (have, cause, [have, an, active, role, in] , cause, [def]) . 
semp (is, cause, [is, a, means, of, producing] , cause, [def]) . 
semp (is, cause, [is, due, to] , cause, [2 , rev] ) . 

semp (functions, inactivate, [functions , as , a, negative, regulator , of ] , i 
nactivate, [def] ) . 

semp (function, inactivate, [function, as, a, negative, regulator, of] , ina 



[def] ) . 
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ctivate, [def ] ) . 

setup (lead, cause, [lead, to] , cause, [def] ) . 
semp (lead, causel , [lead, to] , cause, [def] ) . 
semp (leading, cause, [leading, to] , cause, [def]) . 
semp (leading, cause, [leading, to] , cause, [def]) . 
semp (leads, cause, [leads, to], cause, [def ]) . 
semp (leads, causel, [leads, to] , cause, [def] ) . 
semp (led, cause, [led, to] , cause, [def] ) . 

semp (may, cause, [may , be , responsible , for] , cause, [def]) . 
semp (mediate, signal, [mediate, a, signal], signal, [def]) 
mediates a signal to B 

[mediated, a 



^A 



semp (mediated, signal , 
A mediates a signal to B 
semp (mediates, signal, [mediates, 
A mediates a signal to B 
semp (mediation, signal , [mediation, of , 
%A mediates a signal to B 



signal] , signal, [def] ) . 
signal] , signal, [def] ) . 
a, signal] , signal, [def] ) 



semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp (n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp(n, createbond, [n, 

semp (n,breakbond, [n, 1 - ' 
sis T , [def] ) . 
semp (o, createbond, 
semp (o, createbond, 
semp (o, createbond, 
semp (o, createbond, 



[o, 
[o, 
[o, 
[o, 



, acetylate] , 
, acetylated] 
, acetylates] 



N-acetylate 1 , [def] ) . 
'N-acetylate 1 , [def] ) . 
"N-acetylate 1 , [def] ) . 
, acetylation] , 'N-acetylate' , [def] ) . 
,acylate] , 'N-acylate' , [def] ) . 
,acylated] , 'N-acylate' , [def] ) . 
,acylates] , 'N-acylate' , [def] ) . 
,acylation] , 'N-acylate' , [def] ) . 
, glycosylate] , 'N-glycosylate ' , [def] ) . 
glycosylated] , 'N-glycosylate' , [def] ) . 
, glycosylates] , f N-glycosylate ! , [def] ) . 
, glycosylation] , 'N-glycosylate' , [def] ) . 
terminal, proteolysis] , ! n-terminal proteoly 



' O-glycosylate ' , [def] ) . 
' O-glycosylate ' , [def] ) . 
' O-glycosylate 1 , [def] ) . 
, ' O-glycosylate f , [def]) 
[2, rev] ) . 



' , glycosylate] , 
' , glycosylated] , 
' , glycosylates] , 
' , glycosylation] 
semp (only, time, [only, after] , 'only after', 
semp (prolyl , createbond, [prolyl , 1 - ' , 4 , 1 - ' , hydroxylate] , 

' prolyl -4 -hydroxylate ' , [def] ) . 
semp (prolyl , createbond, [prolyl , ' - ' , 4 , ' - ' , hydroxylated] , 

' prolyl -4 -hydroxylate ' , [def] ) , 
semp (prolyl , createbond, [prolyl , * - ' , 4 , ' - ' , hydroxylates] , 

' prolyl -4 -hydroxylate ' , [def] ) . 
semp (prolyl , createbond, [prolyl , ' - ' , 4 , ' - ' , hydroxylation] , 
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1 prolyl -4 -hydroxylate ' , [def] ) . 
semp (result , cause, [result , from] , cause, [2, rev]) . 
semp (result, cause, [result, in] , cause, [def] ) . 
semp (resulted, cause, [resulted, from] , cause, [2, rev] ) . 
semp (resulted, cause, [resulted, in] , cause, [def]) . 
semp (resulting, cause, [resulting, from] , cause, [2, rev]) . 
semp (resulting, cause, [resulting, in] , cause, [def]) . 
semp (results, cause, [results , from] , cause , [2, rev]) . 
semp (results, cause, [results, in] , cause, [def]). 
semp(set, release, [set, free], release , [def ] ) . 
semp(set, release, [set, free], release , [def ] ) . 
semp(sets, release, [sets, free], release , [def ] ) . 
semp (setting, release, [setting, free], release , [def ] ) . 
semp (suppress, inactivate, [suppress, activity, of], inactivate, [ 
def] ) . 

semp (suppressed, inactivate, [suppressed, activity, of], inactivat 
e, [def]). 

semp (suppresses, inactivate, [suppresses, activity, of], inactivat 
e, [def]). 

semp (suppression, inactivate, [suppression, of , activity, of], inac 
tivate, [def] ) . 

semp (switch, activate, [switch, on, the, activity, of], activate 
, [def]). 

semp (switched, activate, [switched, on, the, activity, of], acti 
vate, [def] ) . 

semp (switches, activate, [switches, on, the, activity, of], acti 
vate, [def] ) . 

semp (up, signal, [up, regulate] , signal, [2, rev]). % A up-regul 
ates B B --> A 

semp (up, signal, [up, 1 - ' , regulated] , signal, [2 , rev] ) . 
semp (up, signal, [up, 1 - 1 , regulates] , signal, [2 , rev] ) . 
semp (up, signal, [up, * - ' , regulation] , signal, [2, rev] ) . 
semp (was, cause, [was , a , means , of , producing] , cause, [def]) . 
semp (was , cause , [was , due , to] , cause , [2 , rev] ) . 

semp (were, cause, [were , a, means , of , producing] , cause, [def]) . 

semp (were , cause , [were , due , to] , cause , [2 , rev] ) . 

semw (acetylate, createbond, acetylate, [def]) . 

semw (acetylated, createbond, acetylate, [def]) . 

semw (acetylates , createbond, acetylate, [def]) . 

semw (acetylation, createbond, acetylate, [def]) . 

semw (activate , activate, activate, [def]) . 

semw (activated, activate, activate, [def]) . 

semw (activates, activate, activate, [def] ) . 
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semw (activation, activate, activate, [def ] ) . 
semw (add, attach, attach, [def] ) . 
semw (added, attach, attach, [def] ) . 
semw (addition, attach, attach, [def]) . 
semw(adds, attach, attach, [def] ) . 

semw (after , time, after, [2, rev]) . % temporal relations 

semw (aggregate , attach, attach, [def]) . 

semw (aggregated , attach, attach, [def]) . 

semw (aggregates, attach, attach, [def]) . 

semw (aggregation , attach, attach, [def] ) . 

semw(arrest, inactivate, inactivate, [def]) . 

semw (arrested, inactivate, inactivate, [def]) . 

semw (arrests , inactivate, inactivate, [def]) . 

semw (associate, attach, attach, [def]) . 

semw (associated, attach, attach, [def]) . 

semw (associates , attach, attach, [def]) . 

semw (association, attach, attach, [def]) . 

semw (attach, attach, attach, [def]) . 

semw (attached , attach, attach, [def]) . 

semw (attaches, attach, attach, [def]) . 

semw (attachment , attach, attach, [def]) . 

semw (bind, attach, attach, [def]) . 

semw (binding, attach, attach, [def]) . 

semw (binds, attach, attach, [def]) . 

semw (block, inactivate, inactivate, [def]) . 

semw (blocked, inactivate, inactivate, [def]) . 

semw (blocking, inactivate, inactivate, [def]) . 

semw (blocks, inactivate, inactivate, [def]) . 

semw (bound, attach, attach, [def]) . 

semw (break, breakbond, 'break bond' , [def] ) . 

semw (breakage, breakbond, 'break bond', [def]) . 

semw(breaks, breakbond, 'break bond', [def]) . 

semw (broke, breakbond, 'break bond' , [def] ) . 

semw(broken, breakbond, 'break bond', [def]). % case without break 
bond 

semw (catalyzation, promote, catalyze, [def]) . 
semw (catalyze, promote, catalyze, [def] ) . 
semw (catalyzed, promote, catalyze, [def]) . 
semw (catalyzes, promote, catalyze, [def]) . 
semw (catalyzing, promote, catalyze, [def]) . 
semw (cause, cause, cause, [def] ) . 
semw(caused, cause, cause, [def]) . 
semw (causes, cause, cause, [def] ) . 
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semw (cleavage, breakbond, 'break bond', [def ] ) . 

semw(cleave, breakbond, 'break bond 1 , [def]) . 

semw (cleaved, breakbond, T break bond 1 , [def] ) . 

semw (cleaves, breakbond, 'break bond' , [def] ) . 

semw (coimmunoprecipitate, attach, attach, [def]) . 

semw (coimmunoprecipitated , attach, attach, [def]) . 

semw(coimmunoprecipitates, attach, attach, [def]) . 

semw (coimmunoprecipitation , attach, attach, [def]) . 

semw (combination , attach, attach, [def]) . 

semw (combine , attach, attach, [def] ) . 

semw (combined , attach, attach, [def]) . 

semw (combines, attach, attach, [def]) . 

semw (conjugate , attach, attach, [def]) . 

semw (conjugated , attach, attach, [def]) . 

semw (conjugates, attach, attach, [def] ) . 

semw (conjugation , attach, attach, [def]) . 

semw (connect , attach, attach, [def] ) . 

semw (connected , attach, attach, [def]) . 

semw (connection , attach, attach, [def]) . 

semw (connects, attach, attach, [def]) . 

semw (constrain, inactivate, inactivate, [def]) . 

semw (constrained, inactivate, inactivate, [def]) . 

semw (constrains, inactivate, inactivate, [def]) . 

semw (constraint , inactivate, inactivate, [def]) . 

semw (coprecipitate, attach, attach, [def]) . 

semw (coprecipitated, attach, attach, [def]) . 

semw (coprecipitates, attach, attach, [def]) . 

semw (coprecipitation , attach, attach, [def]) . 

semw (copurif ication , attach, attach, [def]) . 

semw (copurif ied , attach, attach, [def]) . 

semw (copurif ies, attach, attach, [def]) . 

semw (copurify , attach, attach, [def]) . 

semw (couple , attach, attach, [def]) . 

semw (coupled, attach, attach, [def]) . 

semw (couples , attach, attach, [def]) . 

semw(cut, breakbond, 'break bond 1 , [def]) . % leave breakbond onl 

y? 

semw (cuts, breakbond, 'break bond 1 , [def]) . 
semw (deactivate, inactivate, inactivate, [def]) . 
semw (deactivated, inactivate, inactivate, [def]) . 
semw (deactivates , inactivate, inactivate, [def]) . 
semw (deactivation, inactivate, inactivate, [def]) . 
semw(death, process, death, [1]) . 



Page 6 



lexsemact . pat . txt 



semw (demethylate, breakbond, demethylate, [def ] ) . 

semw(demethylated, breakbond, demethylate, [def]) . 

semw (demethylates, breakbond, demethylate, [def]) . 

semw (demethylation, breakbond, demethylate, [def]) . 

semw (dephosphory late, breakbond, dephosphorylate , [def]) . 

semw (dephosphorylated, breakbond, dephosphorylate , [def]) . 

semw(dephosphorylates / breakbond, dephosphorylate , [def]) . 

semw (dephosphorylation, breakbond, dephosphorylate , [def] ) . 

semw(die, process, death, [1]) . 

semw(died, process, death, [1]) . 

semw(dies, process, death, [1]) . 

semw (disassemble, release, release, [def]) . 

semw (disassembled, release, release, [def]) . 

semw (disassembles, release, release, [def]) . 

semw (disassembly, release, release, [def]) . 

semw (discharge, release, release, [def]) . 

semw (discharged, release, release, [def]) . 

semw (discharges, release, release, [def]) . 

semw (disengage, release, release, [def]) . 

semw (disengaged, release, release, [def]) . 

semw (disengagement , release, release, [def]) . 

semw (disengages, release, release, [def]) . 

semw(divide, breakbond, 'break bond 1 , [def]) . 

semw (divided, breakbond, 'break bond' , [def] ) . 

semw (divides, breakbond, 'break bond' , [def] ) . 

semw (division, breakbond, 'break bond' , [def] ) . 

semw(dying 7 process, death, [1]) . 

semw (enhance, promote, promote, [def]) . 

semw (enhanced, promote, promote, [def]) . 

semw (enhancement , promote, promote, [def]) . 

semw (enhances, promote, promote, [def]) . 

semw (enhancing, promote, promote, [def]) . 

semw (express, generate , express , [def]) . % can have either 1 or 2 ar 
guments 

semw (expressed, generate , express , [def]) . 
semw (expresses, generate, express, [def]) . 
semw (expressing, generate, express, [def] ) . 
semw (expression, generate, express, [def]) . 
semw (generate, generate, generate, [def]) . 
semw (generated, generate , generate , [def]) . 
semw (generates , generate, generate, [def] ) . 
semw (generating, generate, generate, [def]) . 
semw (generation, generate, generate, [def]) . 
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semw(hew, breakbond, 'break bond' , [def ] ) . 
semw (hewed, breakbond, 'break bond 1 , [def] ) . 
semw (hews, breakbond, 'break bond' , [def] ) . 
semw(hinder / inactivate, inactivate, [def]) . 
semw (hindered, inactivate, inactivate, [def]) . 
semw (hinders, inactivate, inactivate, [def]) . 
semw (hindrance, inactivate, inactivate, [def]) . 
semw (inactivate, inactivate, inactivate, [def] ) . 
semw (inactivated, inactivate, inactivate, [def]) . 
semw (inactivates, inactivate, inactivate, [def]) . 
semw (inactivation, inactivate, inactivate, [def]) 
semw(incite, activate, activate, [def]) . 
semw (incited, activate, activate, [def]) . 
semw (incitement , activate, activate, [def] ) . 
semw (incites, activate, activate, [def] ) . 
semw(induce, activate, activate, [def]) . 
semw (induced, activate, activate, [def]) . 
semw (induces, activate, activate, [def]) . 
semw (induction, activate, activate, [def]) . 
semw (influence, activate, activate, [def]) . 
semw (influenced, activate, activate, [def]) . 
semw (influences , activate, activate, [def]) . 
semw (influencing, activate, activate, [def]) . 
semw (inhibit , inactivate, inactivate, [def] ) . 
semw (inhibited, inactivate, inactivate, [def]) . 
semw (inhibition, inactivate, inactivate, [def] ) . 
semw (inhibits, inactivate, inactivate, [def]) . 
semw (initiate , activate, activate, [def]) . 
semw (initiated, activate, activate, [def]) . 
semw (initiates, activate, activate, [def]) . 
semw (initiattion, activate, activate, [def]) . 
semw (instigate, activate, activate, [def]) . 
semw (instigated, activate, activate, [def]) . 
semw (instigates , activate, activate, [def]) . 
semw (instigation, activate, activate, [def]) . 
semw (interact , interact, interact, [def]) . 
semw (interacted, interact, interact, [def]) . 
semw (interaction, interact, interact, [def]) . 
semw (interactions , interact, interact, [def]) . 
semw (interacts, react, interact, [def]) . 
semw (join , attach, attach, [def] ) . 
semw(joined , attach, attach, [def]) . 
semw (joining, attach, attach, [def]) . 
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semw(joins, attach, attach, [def] ) . 

semw (juncture, attach, attach, [def ] ) . 

semw (liberate, release, release, [def] ) . 

semw (liberated, release, release, [def]) . 

semw (liberates , release, release, [def] ) . 

semw (liberation, release, release, [def]) . 

semw (limit, inactivate, inactivate, [def]) . 

semw (limitation, inactivate, inactivate, [def]) . 

semw (limited, inactivate, inactivate, [def ]) . 

semw(limits, inactivate, inactivate, [def]) . 

semw (link, attach, attach, [def] ) . 

semw (linked, attach, attach, [def]) . 

semw (linking, attach, attach, [def] ) . 

semw (links, attach, attach, [def]) . 

semw (mediate, promote, promote, [def]) . 

semw (mediated, promote, promote, [def]) . 

semw (mediates, promote, promote, [def] ) . 

semw (mediation, promote, promote, [def]) . 

semw (methylate, createbond, methylate, [def]) . 

semw (methylated, createbond, methylate, [def] ) . 

semw (methylates, createbond, methylate, [def]) . 

semw (methylation, createbond, methylate, [def]) . 

semw (modification, modify, modify, [def]) . 

semw (modified, modify, modify, [def]) . 

semw (modifies, modify, modify, [def]) . 

semw (modify, modify, modify, [def]) . 

semw (modifying, modify, modify, [def] ) . 

semw (mutate, modify, mutate, [1]) . 

semw (mutated, modify, mutate, [1]) . 

semw (mutates, modify, mutate, [1]) . 

semw (mutating, modify , mutate , [1]) . 

semw (mutation, modify, mutate, [1]) . 

semw (overexpressed, generate, overexpress, [def]) . 

semw (overexpresses, generate, overexpress, [def]) . 

semw (overexpressing, generate, overexpress, [def] ) . 

semw (overexpress, generate, express, [def]) . 

semw (overexpression, generate, overexpress, [def]) . 

semw (pair , attach, attach, [def] ) . 

semw (paired, attach, attach, [def]) . 

semw (pairing, attach, attach, [def]) . 

semw (pairs , attach, attach, [def]) . 

semw (phosphorylate, createbond, phosphorylate, [def]) . 
semw (phosphorylated, createbond, phosphorylate, [def]) . 
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semw(phosphorylates, createbond, phosphorylate , [def ] ) . 

semw (phosphorylation, createbond, phosphorylate, [def]) . 

semw (precede, cause, cause, [def]) . 

semw (preceded, cause, cause, [def]). 

semw (precedes , cause, cause, [def]). 

semw (preceding, cause, cause, [def]). 

semw (promote, promote, promote, [def]) . 

semw (promoted, promote, promote, [def] ) . 

semw (promotes, promote, promote, [def] ) . 

semw (promotion, promote, promote, [def]) . 

semw(prompt, activate, activate, [def]) . 

semw (prompted, activate, activate, [def]) . 

semw (prompting, activate, activate, [def]) . 

semw (prompts, activate, activate, [def]) . 

semw (react, react, react , [def ]) . 

semw (reacted, react, react, [def ]) . 

semw (reaction, react, react, [def]) . 

semw (reactions, react, react, [def]) . 

semw (reacts, react, react, [def] ) . 

semw (regulate, signal, signal, [def]) . 

semw (regulated, signal, signal , [def ]) . % B is regulated by 

A A --> B 

semw (regulates, signal, signal, [def]) . 
semw (regulation, signal, signal, [def]) . 
semw (release, release, release, [def]) . 
semw (released, release, release, [def]) . 
semw (releases, release, release, [def]) . 
semw (removal , breakbond, 'break bond ', [def]) . 
semw (remove, breakbond, 'break bond 1 , [def] ) . 
semw(remove, breakbond, 'break bond ', [def]) . 
semw (removes , breakbond, 'break bond ', [def]) . 
semw (replace, substitute, substitute, [def]) . 
semw (replaced, substitute, substitute, [def] ) . 
semw (replacement , substitute, substitute, [def] ) . 
semw (replaces, substitute, substitute, [def]) . 
semw (repress, inactivate, inactivate, [def]) . 
semw (repressed, inactivate, inactivate, [def]) . 
semw (represses, inactivate, inactivate, [def] ) . 
semw (repression, inactivate, inactivate, [def]) . 
semw (require, cause, cause, [2, rev] ) . 
semw (required, cause, cause, [2, rev] ) . 
semw (requirement , cause, cause, [2, rev]) . 
semw (requires , cause, cause, [2, rev] ) . 
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requiring, cause, cause, [2, rev] ) . 
restrain, inactivate, inactivate, [def ] ) . 
restrained, inactivate, inactivate, [def]) . 
restrains, inactivate, inactivate, [def]) . 
restraint, inactivate, inactivate, [def]) . 
sensitization, activate, activate, [def] ) . 
sensitize, activate, activate, [def]) . 
sensitized, activate, activate, [def] ) . 
sensitizes, activate, activate, [def]) . 
separate, breakbond, 'break bond' , [def] ) . 
separated, breakbond, 'break bond 1 , [def] ) . 
separates, breakbond, 'break bond' , [def] ) . 
separation, breakbond, 'break bond 1 , [def]) . 
sever, breakbond, 'break bond* , [def] ) . 
severance, breakbond, 'break bond' , [def] ) . 
severed, breakbond, 'break bond', [def]) . 
severs, breakbond, 'break bond', [def]) . 
signal, signal, signal, [def]) . 
signaled, signal, signal, [def]) . 
signaling, signal, signal, [def]) . 
signals, signal, signal , [def ]) . 
split, breakbond, "break bond' , [def] ) . 
splits, breakbond, 'break bond', [def]) . 
splitting, breakbond, 'break bond', [def]) . 
stimulate, activate, activate, [def]) . 
stimulated, activate, activate, [def] ) . 
stimulates, activate, activate, [def]). 
stimulation, activate, activate, [def]) . 
substitute, substitute, substitute, [def] ) . 
substituted, substitute, substitute, [def]) . 
substitutes, substitute, substitute, [def]) . 
substitution, substitute, substitute, [def]) 
suppress, inactivate, inactivate, [def]) . 
suppressed, inactivate, inactivate, [def] ) . 
suppresses, inactivate, inactivate, [def] ) . 
suppression, inactivate, inactivate, [def] ) . 
tie, attach, attach, [def] ) . 
tied, attach, attach, [def] ) . 
ties , attach, attach, [def]) . 
transcribe, generate, transcribe, [def]) . 
transcribed, generate, transcribe, [def]) . 
transcribes , generate, transcribe, [def]) . 
transcribing, generate, transcribe, [def]) . 
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transcription, generate, transcribe, [def ] ) . 

ubiquitinize, createbond, ubiquitinize , [def]) . 

ubiquitinize, createbond, ubiquitinize, [def]) . 

ubiquitinized, createbond, ubiquitinize, [def]) 

ubiquitinizes , createbond, ubiquitinize, [def]) 

urge, activate, activate, [def]) . 

urge, activate, activate, [def]) . 

urged, activate, activate, [def]) . 

urges, activate, activate, [def]) . 

urging, activate, activate, [def] ) . 

form, attach, attach, [def] ) . 

forms, attach, attach, [def]) . 

formed, attach, attach, [def] ) . 

forming, attach, attach, [def] ) . 

formation, attach, attach, [def]) . 

assemble, attach, attach, [def]) . 

assembles, attach, attach, [def]) . 

assembled, attach, attach, [def]) . 

assembling, attach, attach, [def]) . 

assembly, attach, attach, [def]) . 

dissassemble, release , release, [def]) . 

dissassembles, release, release, [def]) . 

dissassembled, release, release, [def]) . 

dissassembling, release , release , [def]) . 

dissassembly, release, release, [def]) . 

dissociate, release, release, [def]) . 

dissociates, release, release, [def]) . 

dissociated, release, release, [def]) . 

dissociating, release, release, [def]) . 

dissociation, release, release, [def]) . 

recruit , attach, attach, [def] ) . 

recruits, attach, attach, [def]) . 

recruited, attach, attach, [def]) . 

recruiting, attach, attach, [def]) . 

recruitment , attach, attach, [def]) . 
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% edited Genome grammar - adapted from MedLEE 's grammar for use with MedLEE 

% this is to be used along with the genomics lexicon of substances, actions, 

% and relations. 

% revised March 16, April 5, 2 000 

% adjusted for tagged input 

:- multifile (wdef/3) . 

: - multifile (phrase/5) . _ 0 0 0 0 0 0 o o o o o o o o o o o s^<^<^<^ 

%%%%%%%%%%%%%%%%%%%% Semantic Grammar for Genomics %%%%%«««««^^^«^ 00 ° 
o, % 

% Written by Carol Friedman for the MedLEE System 

% 

% Queens College of the City University of New York 



^.0.0,0,0,0, 



*5'5"o'5oooo©ooooo< 



„ Highest Level Predicate - sem_sent - 1st arg. is target structure * 
% - 2nd arg. is a list of words in sentences 

% - 3rd arg. is f [] ' * 

% Target structure: a frame or set of connected frames: ^ * 

% the frame describes an action or several related actions; % 

% an action frame is a list consisting of the symbol 'action 1 ^ % 

% followed by the code for the action and arguments. % 

% The arguments are either substances or actions; * 

% each substance slot consists of the name of the type of * 

% substance followed by the value for the substance; % 

% the substance slot may contain slots for several substances. * 

% 

% Examples: ^ 

% Blocking of il-2 gene transcription by activated rapl . * 

% [action, inactivate, [protein, Rapl, [state, active] ] , * 

% [action, transcribe, [x] , [gene , inter leukin- 2] ] ] * 

o, 

% The adapter protein crkl was associated with both phosphorylated cbl and the% 

% guanidine nucleotide-releasing factor c3g. * 

% [action, attach, [protein, CrkL] , * 

% [relation, and, [protein, Cbl , [state, phosphorylated] ] , * 

% [protein, guanidine nucleotide-releasing factor C3G, 

I [state, phosphorylated] ]] ] % 

° ~ ~ ~ ~ ^ r, ^ n r> r> O O O O O O O, O, 0„ O, O, O, O, O, O, 

*^*6^"5"?*S"oOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO« 

% fail an unknown predicate 

- unknown (_, fail) . 

- op(900, fy, [not,once]). % same priority and type as \+ 

- op{700, xfx, [\=,~=]). % same priority and type as = or == 
% snoop is generally used to find input string when using a DCG 

% the input string is used for constraints 

snoop (A, B, A, B) . 



sem_sent (P,£emlist , X) - - > 

{assert (addstotal (0) ) } , 
sem_parse (P, Semlist , X) . 

sem_jparse (Target , Semlist) - - > 

sem_jpatterns {P, Semlist) . 

sem_parse (Target , Semlist , X) --> 
sem_jpatterns {P, Semlist) , 
sem endornot (P, Target, X) . 



sem_parse ( [failure] ,_/X,_,_) : - 
addstotal (X) . 

sem endornot (P,P,X) --> % P is target if there is an endmark 



$ pandit 0 



sem_endmark , 

{addstotal (X) } . % X is number of times reached endmark 
sem_endornot (_,_/_/_,_) % did not reach endmark; update count and fa 

uptotal, fail. 
sem_endornot (__, [failure] , X,_,_) :- 

addstotal (X) , % X is number of times reached 

X >= 50. 



% Finding patterns 



sem_patterns (F,Semlist) --> 
pattern (Fl, Semlist) , 

{Fl \= []}, % 1st finding should not be empty 

morepattern (R, F2 , Semlist) , % connected patterns 
{getrelation(R, F1,F2,F) } . 

/*******************************^^ 

* The action pattern types are: pattern, nounactionpatt, actpatt, and * 

* nounactpatt. * 

* pattern --> actionarg (Al) * 

* active or passive verb * 

* actionarg (A2) . * 

* pattern --> nounactionpatt . * 

* pattern --> actpatt. * 
*************************************************** 

% pattern is saved in a symbol table (st) ; check for success/failure 1st 
% Case where pattern is in st and has been successful 
pattern (Fmt,_) --> checkst (pattern, __, s, Fmt) . 
% Case where pattern is in st as a failure . 
pattern {_,_) --> checkst (pattern>__, f,_) , {!, fail}. 

% pattern 5 : an action pattern with a nominal verb 
% Psl cleavage by zvad. 

% apoptos is -induced cleavage of PS2 by zDEVD. 
pattern (F, Semlist) - -> 
snoop (SO, SO) , 
{ \+ checkst (pattern, 5,_,_, SO, _) , 
actionchk (Semlist) }, 
nounactionpatt (F) , 
snoop (S, S) , 
{ addst (pattern, 5, s, F, SO, S) 

}. 



% pattern 1: an action/substance acts on an action/substance 

% the activation of rapl inhibits the expression of il-2 

% rapl functions as a negative regulator of tcr-mediated il-2 gene 

% transcription. 

pattern (F, Semlist) --> snoop (SO , SO) , % SO is the input string 
{ \+ checkst (pattern, 1,_,_, SO, _) , 
actionchk (Semlist) , 
connectchk (Semlist) } , 
actionarg (Al) , 
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connectact (Sem, [v,vp,ved] , Target , Features) , 
actionarg (A2) , 

snoop(S,S), %ending sentence list 
{ member (def, Features), 

modlist( [Al,A2,Site] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2 / Al,Site] ,Mods) ) , 
frame (Faction, Target; Mods) , 
addst (pattern, 1, s , F, SO , S) 

% pattern 2: an action/substance was acted on by an action/substance 
% The aggregation of bad was suppressed. 

% The aggregation of bad was suppressed by the phosphorylation of jnk. 
% Grb2 was associated with Cbl . 

% Apoptosis-associated cleavage of endogenous PS1 was blocked by the 
% treatment with zVAD. 
pattern (F, Semlist) - -> 

snoop(S0,S0) , % SO is the input string 
{ \+ checkst (pattern, 2, SO, _) , 
actionchk (Semlist) , 
connectchk (Semlist) } , 
actionarg <A2) , 
sem_beterm (__) , % was 

connectact (Sem, [ven] , Target , Features) , %activated 
optbyarg (Al) , 

snoop (S,S), %ending sentence list 
{ (member(def, Features), 

modlist ( [Al, A2, Site] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2, Al, Site] ,Mods) ) , 
frame (F, action, Target, Mods) , 
addst (pattern, 2 , s , F, SO , S) 

}• 

% pattern 3: an action/substance acted on an action/substance 

% bad induced phosphorylation of fyn. 

% tcr and cd2 8 -mediated il-2 transcription. 

pattern (F, Semlist) --> 

snoop (SO, SO) , 
{ \+ checkst (pattern, 3,_,_,S0,_), 

actionchk (Semlist) , 

connectchk (Semlist) } , 

actionarg (Al) , % substance or basic action 
% opt dash, 

connectacts (Sem, [vp, ven, ved] , Target , Features) , % Activated' 
% optof, 

actionarg (A2 ) , % had pattern here 
snoop ( S , S ) , 
{ (member (def , Features) , 

modlist ( [Al,A2,Site] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2,Al,Site] ,Mods) ) , 
frame (F, action, Target ,Mods) , 
addst (pattern, 3 , s , F, SO , S) 



% pattern 4: a simple action pattern with an active verb. 
% Activated Raf-1 phosphorylates MEK-1. 
pattern (F, Semlist) --> 
snoop (SO, SO) , 

%check that sentence has an action word/phrase 
{ \+ checkst (pattern, 4 SO ,_) , 

act ionchk (Semlist) } , 

actpatt (F) , 

snoop (S, S) , 
{ addst (pattern, 4, s,F, SO, S) 

}• 

% no more patterns - save failure 

pattern (_,_) --> addst (pattern, 0, f,_) , {l , fail}. 

% semjnorepatternt-Rel^Pj+Semlist^SO,-^) : 
% Rel is a relation and its value frame; 

% P is the remaining patterns, Semlist is the list of semantic classes 

% in sentence 

% if have a series of «, 's, use the relation "and" or "or" if in the nest 
% and make that the relation 
morepattern (R, F, Semlist) --> 

sem_relat ion (Rl, Modi) , %relation and modifiers 

sem_j?atterns (F, Semlist) , 

{( frame(F,rel / Conj2 / _) , % F contains nested relation 

(Conj2 = and; Conj2 = or), frame (Rl, rel, ' , ' , J , % Rl relation frame 
frame (R, rel, Con j 2, _) % value of relation is Conj2 

Rl \= [] , % where do Type, Value and Mods2 come from? 
frame (Rl, Type, Value , Mod2 ) , % get components of original relation 
mergemods (Modi , Mod2 , Mods ) , 
( Mods = [] , frame (R, rel, Value, []) , I; 

%frame(R,rel, [Value |Mods] , [] ) % make it rel connector with rel mod 
R = [rel, [Value [Mods] ] 

) 

) 

}• 

% no more findings 
morepattern ( [] , [] S, S) . 

% actionarg is the argument of pattern 

% actionarg is either a substance or a basic action 

% actionarg is saved in a symbol table (st) ; check for success/failure 1st 
% Case where actionarg is in st ancj, have been successful 
actionarg(A) --> checkst (actionarg, _, s, A) . 
% Case where actionarg is in st as a failure, 
actionarg (_) --> checkst (actionarg, _, f ,__) , {1/ fail}. 

% actionarg 1 : a substance or substances 
% Rapl, active Rapl, Cbl and Crkl 

actionarg(A) --> snoop (SO , SO) , % SO is the input string 
{ \+ checkst (actionarg, SO, __) } , 

substances (A) , 

snoop (S, S) , 
{ addst (actionarg, l,s,A,S0,S) }. 
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% actionarg 2: a process like apoptosis, or a disease 
actionarg(A) --> snoop (SO , SO) , % SO is the input string 
{ \+ checkst (actionarg, 2, SO, _)} , 
processpatt (A) , 
snoop ( S , S ) , 
{ addst (actionarg, 2,s,A,S0,S) 

}• 

% actionarg 3: a nominal action pattern 
% Etoposide- induced apoptosis. 
% Etoposide -induced PS1 cleavage by zVAD. 
actionarg(A) --> snoop (SO, SO) , % SO is the input string 
{ \+ checkst (actionarg, 3 SO, __) } , 

nounactionpatt (A) , 

snoop (S, S) , 

{addst (actionarg, 3, s, A, SO, S) 

}• 

% actionarg 4: the object of the nominal action is an actionarg 
% Blocking of IL-2 Gene transcription by activated rapl. 
actionarg(A) --> snoop (SO , SO) , % SO is the input string 
{ \+ checkst (actionarg, 4 SO, _) }, 

action (Sem, [n,ving] , Target , Features) , 
[of] , 

actionarg (Al) , 
optbyagent ( A2 ) , 
snoop ( S , S ) , 
{ ( member ( de f , Fea tur e s ) , 
modlist ( [A1,A2] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2,A1] ,Mods) ) , 
frame (A, action, Target, Mods) , 
addst (actionarg, 4 , s , A, SO , S) 

}• 



% no more actionarg - save failure 

actionarg (_) --> addst (actionarg, 0 , f, _) , {!, fail}. 

% nounactionpatt is a nominal action pattern which allows for left and right 
% modifiers 

% 11-2 gene transcription mediated by tcr and cd28 was inhibited by rapl. 

% Activated rapl functions as a negative regulator of tcr and cd-28-mediated 

il_2 transcription. 

% nounactionpatt is saved in a symbol tablte (st) ; check for success/failure 
% Case where nounactionpatt is in st and has been successful 
nounactionpatt (A) --> checkst (nounactionpatt ,_, s , A) . 
% Case where nounaction patt is in st as a failure, 
nounactionpatt (_) --> checkst (nounactionpatt ,_, f,_) , {!, fail}. 

nounactionpatt (P) --> snoop (SO , SO) , % SO is the input string 

{ \+ checkst (nounactionpatt , 1 ,__/_/ S0,_) } , 
actionlmod(L,Synl) , 
nounactionunit (A) , 
actionrmod(R, Syn2) , 
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snoop (S, S) , 
{ (Synl = ved, append (R, [A] , RA) , 
append (L, RA, P) ; 
Synl = ving, append (R, [A], RA) , 
L = [action, Verb, Ob j ect] , 
modlist (RA, Object, Mods), 
frame (P, action, Verb, Mods) ) , 
addst (nounactionpatt , 1, s, P, SO, S) } . 
% no more nounactionpatt - save failure 

nounactionpatt (_} - - > addst (nounactionpatt , 0 , f , _) , { ! , fail } . 

% the central unit of the nounactionpatt is a nounactpatt or a process 
nounactionunit (A) --> nounactpatt (A) . 
nounactionunit (A) --> process (A) . 

% left modifiers of nounactpatt 
% Zvad- inhibited cleavage pf Psl 
actionlmod(L, ved) --> substances (S) , 

optdash, 

action (Sem, [ved] , Target , Features ) , 
{ frame (L, action, Target, [S] ) }. 

% apoptosis induced cleavage of ps2 
actionlmod (L, ved) --> process (S) , 

optdash, 

action (Sem, [ved] , Target , Features ) , 
{ frame (L, action, Target, [S] ) }. 

% apoptosis causing cleavage of Psl by Zvad. 
% need to invert the order of nounactpatt and actionlmod 
actionlmod(L,ving) --> processobj ect (A) , % process or nounacpatt, 

action (Sem, [ving] , Target, Features) , 
{ frame (L, action, Target, A) }. 

actionlmod( [] ,__) --> [] . 

actionrmod(R,ved) --> action (Sem, [ved] , Target , Features) , _ 

byagent (A) , % may have to add ving to actionrmod 
{ frame (R, action, Sem, A) }. 
actionrmod ( [],__) --> □ • 



% actpatt parses a simple action between substances expressed by an active verb 

o, 

% actpatt is saved in a symbol table (st) ; check for success/failure % % 1st 

% Case where actpatt is in st and has been successful 

actpatt(F) --> checkst (actpatt, _,s, F) . 

% Case where actpatt is in st as a failure. 

actpatt (_) --> checkst (actpatt, f,_) , {!, fail}. 

% actpatt 1: substance acts on substance 
% PDK1 phosphorylates p70s6k at Thr229 
actpatt (F) --> 

snoop (SO , SO) , % SO is the input string 
{ \+ checkst (actpatt, 1 ,_,_,S0,_)}, 
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substances (Al) , 

sem__whichrel, % opt 'that' 

action (Semclass, [vp,ved] , Target , Features) , 

prepopt, % added prepopt to allow action 'to 1 and -with' substance 
substances (A2) , 
siteinfo (Site) , 
snoop (S,S) , 
{ (member (def, Features), 

modlist ( [Al,A2,Site] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2 / Al / Site] ,Mods) ) , 
frame (Faction, Target, Mods) , 
addst (actpatt, 1 ,s,F,S0,S) 

}• 

% acpatt 2 : 

% Substance was bound by Substance 

% Substance was associated to substance. 

% F can give either first or second place to the second argument; 
% a byagent gets first position; prepagent gets second. 
% Phosphorylated Fyn was associated with Cbl . 



actpatt (F) --> 

snoop(S0,S0) , % SO is the input string 
{ \+ checkst (actpatt, 2 SO, _) }, 
substances (Al) , 
sem_beterm (_) , 

action (Semclass, [ven] , Target , Features) , 
optbyorprepagent (Position, A2) , 
snoop (S , S) , 
{ (member (def, Features), 

(Position=second, modlist { [Al,A2,Site] ,Mods) ; 
Position= first, modlist ( [A2,Al,Site] ,Mods) ) ; 
member (rev, Features) , 

(Position=second, modlist ( [A2 , Al, Site] ,Mods) ; 
Position= first, modlist ( [A1,A2 , Site] , Mods) )) , 
frame(F, action, Target, Mods) , 
addst (actpatt ,2,s,F,S0,S) 

}. 

% no more actpatt - save failure 

actpatt (_) --> addst (actpatt, 0, f,_) , {', fail}. 



% nounactpatt parses a simple action between substances expressed by a nominal 

% verb 

% 

% nounactpatt is saved in a symbol table (st) ; check for success/failure 1st 
% Case where nounactpatt is in st and have been successful 
nounactpatt (Fmt) --> checkst (nounactpatt s , Fmt) . 
% Case where nounactpatt is in st as a failure, 
nounactpatt (_) --> checkst (nounactpatt ,_, f,_) , {!, fail}. 



% nounactpatt 1: 

% Jnk phosphorylation of Bad 

nounactpatt (F) --> 

snoop (SO, SO) , % SO is the input string 
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{ \+ checkst (nounactpatt SO ,_) }, 
substances (Al) , 
{aminoacidtest (Al) } , 
optdash, 

action (Semclass, [n] , Target , Features) , 
of object (A2) , 
% siteinfo (Site) , 
snoop (S, S) , 

{ (member (def , Features) , 

modlist ( [Al # A2 f Site] ,Mods) ; 
member (rev, Features) , 
modlist ( [A2,Al,Site] ,Mods) ) , 
frame(F, action, Target, Mods) , 
addst (nounactpatt, 1, s, F, SO, S) 

}• 

% nounactpatt 2: the binding of substance and substance 
% association of Fyn and Cbl. 

% the reason for having this as a separate pattern is to 

% prevent 'Fyn and Cbl' from being parsed together as substances 

nounactpatt (F) --> 

snoop (S0,S0) , % SO is the input string 
{ \+ checkst (nounactpatt , 2 /_/__, SO, _) }, 
action (attach, [ving,n] , Target , Features) , 
ofobjectl (Al) , 
andobject (A2) , 
% siteinfo (Site) , 

snoop (S, S) , 
{ modlist ( [A1,A2, Site] , Mods) , 
frame (F, action, Target, Mods) , 
addst (nounactpatt, 2 , s, F, SO, S) 

}. 

% nounactpatt 3 : 

% The cleavage of protein by substance. 

% Association of phosphorylated Fyn with Cbl 

% Tyrosine phosphorylation of Cbl by kinase 

% optbyorprepagent determines the order of arguments; byagent is placed 
% prepagent is placed second 

nounactpatt ( F) - - > 

snoop(S0,S0) , % SO is the input string 
{ \+ checkst (nounactpatt, 3 _, SO , _) } , 
actionof (F) , 
snoop (S, S) , 
{ addst (nounactpatt , 3 ,s,F,SO,S) }. 

actionof (F) --> 

siteinfo (Site) , 

action (Semclass, [ving,n] , Target , Features) , 
optof object (Al) , 
optbyorprepagent (Position, A2) , 
snoop (S, S) , 
{ (member(def, Features), 

(Position= second, modlist ( [Al,A2,Site] ,Mods) ; 

Position^ first, modlist ( [A2 ,A1, Site] , Mods) ) ; 

member (rev, Features) , 



(Position=second, modlist ( [A2,Al,Site] ,Mods) / 
Position= first, modlist ( [Al , A2 , Site] , Mods) )) , 
frame ( F , action, Target , Mods ) 

}• 

% nounactpatt 4 : 

% Fyn association with Cbl . 

nounactpatt ( F ) - - > 

snoop(S0,S0) , % SO is the input string 
{ \+ checkst (nounactpatt, 4, SO, _) }, 
substances (Al) , 

action (Semclass, [ving,n] , Target , Features) , 
withobject (A2) , 
% siteinfo (Site) , 
snoop (S , S) , 
{ modlist ( [Al,A2 # Site] ,Mods) , 
frame (Fraction, Target, Mods) , 
addst (nounactpatt , 4 , s , F , SO , S ) 

}■ 

aminoacidtest (X) :- X \= [aminoacid] _] . 

% nounactpatt 5 : 

% IL- 2 gene transcription 

% Cbl phosphorylation [by substance or action] 
nounactpatt (F) --> 

snoop (SO,SO) , % SO is the input string 
{ \+ checkst (nounactpatt , 5 ,_,_,S0,_) }, 

substances (A2) , 

optdash, 

action (Semclass, [n] , Target , Features) , 

optbyagent (Al) , 
% siteinfo (Site) , 

snoop (S, S) , 
{ (member (def, Features), 

modlist ( [Al,A2,Site] ,Mods) ; 

member (rev, Features) , 

modlist ( [A2,Al,Site] ,Mods) ) , 

frame (F, action, Target, Mods) , 

addst (nounactpatt, 5 , s, F, SO, S) 

}• 

% nounactpatt 6 : 

% fyn™ cbl association. 

nounactpatt (F) --> 

snoop (SO, SO) , % SO is the input string 
{ \+ checkst (nounactpatt, 6 /_/_/S0,__) }, 

substances (Al) , 

optdash, 

substances (A2) , 

action (Semclass, [n,ving] , Target , Features) , 
% siteinfo (Site) , 
snoop ( S , S ) , 
{ modlist ( [A1,A2, Site] , Mods) , 
frame (F, action, Target, Mods) , 
addst (nounactpatt, 6, s, F, SO, S) 

}• 
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% nounactpatt 7 : 

% Cbl phosphorylated by fyn. 

nounactpatt ( F) - - > 

snoop(S0,S0) , % SO is the input string 
{ \+ checks t (nounactpatt, 7 /_/__/ SO ,_)} , 
substances (Al) , 

action (Semclass, [ven] , Target , Features) , 
[by] , 

substances (A2) , 
% siteinfo (Site) , 

snoop (S, S) , 
% { (member (def, Features), 

{ modlist ( [A2,A1, Site] , Mods) , 
% member (rev, Features) , 

% modlist ( [Al, A2 , Site] ,Mods) ) , 

frame (F, action, Target, Mods) , 
addst (nounactpatt , 7 , s , F, SO , S ) 

}• 

% no more nounactpatt - save failure 

nounactpatt (_) --> addst (nounactpatt , 0, f, _) , {l, fail}. 



connect act (Sem, Syn, Target , Features) --> 
action (Sem, Syn, Target , Features) , 

{member (Sem, [cause, causel, activate, inactivate, signal, substitute, promote] ) } . 

connectacts (Sem, Syn, Target, Features) --> 

connectact (Sem, Syn, Target , Features) . 

% aminoacid like tyrosine : ex.: tyrosine Cbl phosphorylation 
% at position 201 Thr 
siteinfo(S) --> aminoacid (A) , 

{frame (S,site, [A] , [] ) } . 

siteinfo(S) --> 

sitepreps, % 'in 1 , 'at' 
position (S) . 
siteinfo ([]) --> [] . 
sitepreps --> prepterm (in, _) . 
sitepreps --> prepterm (at , _) . 
position(S) --> [position], 

sem_integerterm(I) , 
{ frame (S, site, I, [] ) } . 



% The definitions of actions refer to the lexicons lexsynact.pl and lexsemact.pl 
% Sem is the semantic class; Syn is the syntactic class 
% F is the target 

% oneaction was added for use with moreaction to allow parsing of conjoined 
% actions 

oneaction (activate, Syn, F, Features) --> act ivateterm (Syn, F, Features) , { I } . 
oneaction (attach, Syn, F, Features) --> at tachterm (Syn, F, Features) ,{ I } . 

oneaction (breakbond, Syn, F, Features) --> breakbondterm (Syn, F, Features) , {!} . 
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oneaction (createbond, Syn, F, Features) 
oneact ion ( inactivate , Syn , F , Features ) 
oneaction (react, Syn, F, Features) 
oneaction {release, Syn, F, Features) 
oneaction (signal , Syn, F, Features) 
oneaction (substitute, Syn, F, Features) 
oneaction (transcribe, Syn, F, Features) 
oneaction (promote, Syn, F, Features) 
oneaction (generate, Syn, F, Features) 
oneaction (cause, Syn, F, Features) - -> 



- > createbondt erm { Syn , F , Features ) , { I } 
-> inactivateterm(Syn, F, Features) , { ! } 
-> reactterm (Syn, F, Features) ,{l} . 
--> releaseterm(Syn,F, Features) , { ! } . 
-> signal term ( Syn, F, Features) , { I } - 
-> subs titutet erm (Syn, F, Features) ,{l} 
-> transcribeterm(Syn, F, Features) , { I } 
-> promoteterm (Syn, F, Features) , { ! } . 
-> generateterm (Syn, F, Features) , { I } - 
causeterm (Syn, F, Features) , { ! } . 



action (activate, Syn, F, Features) --> activateterm (Syn, Al , Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\=[] , mergemods( [ [action, Al] ] , Args, Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (attach, Syn, F, Features) --> attachterm (Syn, Al , Features), 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\=[] , mergemods( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (breakbond, Syn, F, Features) --> breakbondterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args, Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (createbond, Syn, F, Features) --> createbondterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] , F -Al; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (inactivate, Syn, F, Features) --> inactivateterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (react, Syn, F, Features) --> reactterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj - [] ,F =A1; 

Conj\=[] , mergemods ([ [action, Al] ] , Args, Actions) , 
frame (Fl, relation, Conj , Actions) , F - [Fl] } . 
action (release, Syn, F, Features) --> releaseterm (Syn, F, Features) , 

moreaction (Conj ,Args) , 
{Conj = [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (signal, Syn, F, Features) --> s ignal term (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\=[] , mergemods { [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (substitute, Syn, F, Features) --> substituteterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\=[] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (transcribe, Syn, F, Features) --> transcribeterm (Syn, F, Features) , 
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moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (promote, Syn, F, Features) --> promo teterm (Syn, F, Features) , 

moreaction (Conj ,Args) , 
{Conj - [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args, Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (generate, Syn,F, Features) --> generate t erm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\= [] , mergemods ( [ [action, Al] ] , Args , Actions) , 
frame (Fl, relation, Conj , Actions) , F= [Fl] } . 
action (cause, Syn, F, Features) --> causeterm (Syn, F, Features) , 

moreaction (Conj , Args) , 
{Conj = [] ,F =A1; 

Conj\=[] , mergemods ( [ [action, Al] ] , Args, Actions) , 
frame (Fl, relation, Conj , Actions) , F = [Fl] } . 

% binds, phosphorylates and activates 
moreaction (Conj , Args) --> sem_conj rest ( Conj 1) , 

oneaction(Sem, Syn, A, Features) , 

moreaction (Conj 2, Alist) , 

{Conj2 = [] , Alist= [] ,Conj=Conjl, Args = [[action, A]] 
Conj2 \= [] , Conj = Conj2, 
addmodt [action, A] , Alist, Args) }. 

moreaction ([],[], S, S) . 



passiveconnect (Sem, [ven] , Target , Features) --> 
sem_beterm (_) , 

connectact (Sem, [ven] , Target , Features) . 



processpatt (A) --> disease (A) . 
processpatt (A) --> process (A) . 



optbyorprepagent (first , A) - - > byagent (A) . 
optbyorprepagent { second, A) - - > prepagent (A) 
optbyorprepagent (first , A) --> [] , {A = x} . 

byorprepagent (first, A) --> byagent (A) . 
byorprepagent (second, A) --> prepagent (A) . 

optbyagent (A) --> byagent (A) . 
optbyagent (A) --> [] , {A = [x] } . 

byagent (A) - - > [by] , 

substances (A) . 
byagent (A) --> [by], 

nounactionpatt (A) . 
prepagent (A) --> withobj ect (A) . 
prepagent (A) --> toobject(A). 
% prepagent (A) --> andobject (A) . 
prepagent (A) --> of object (A). 
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% optprepagent (A) - 
optprepagent (A) - - > 
optprepagent (A) - - > 
optprepagent (A) - - > 
optprepagent (A) --> 
optprepagent (A) --> 



-> byagent (A) . 
of object (A) . 
withobject (A) 
toobject (A) . 
andobj ect (A) . 
[] , {A= [x] } . 



ofobject(A) --> [of], 

nounactionpatt (A) . 
ofobject(A) --> [of], 

substances (A) . 
ofobject(A) --> [of], 

actionof (A) . 
ofobjectl(A) --> [of] , substance (A) . 
optof object (A) --> of object (A), 
optof object ( [x] ) --> [] . 



% to parse Binding of Fyn and Bad 



processobject (A) --> process (A) . % can be expanded to nounactpatt, etc 



% optwithob j ect (A) 
% optwithob j ect (A) 
withobject (A) --> 
toobject (A) --> 
andobj ect (A) --> 
prepob j ect (A) - - > 
prepobject (A) --> 



--> withobject (A) . 

--> [], {A = [x]}. 
[with] , substances (A) 
[to] , substances (A) . 
[and] , substances (A) . 
[to] , substances (A) . 
[with] , substances (A) 



optbyarg (A) - 

optbyarg (A) - 
optbyarg (A) - 



-> [by] , 

actionarg (A) . 
-> substances (A) . 

-> [] / {A = ['substance unknown 1 ]} 



prepopt 
prepopt 
prepopt 
prepopt 
prepopt 



-> [to] . 
[with] 
[by] . 
[of] . 
[] . 



- - > 

- - > 



% toopt 

toopt --> [to] . 
toopt --> [] . 
% wit hop t 

withopt --> [with] . 
withopt - - > [] . 



optdash - -> [ ' - 1 ] . 

optdash --> [ ] . 

optof --> [of] . 

optof --> [ ] . 

/* optactionarg (A) --> actionarg (A) 

optactionarg ( [] ) --> [] . */ 



optactionarg (A) --> 
actionarg (A) . 



% there is no further argument 
optactionarg (A) --> 
[] , 

{a = [] } . 

% substances (F) --> substance (F) . 

% substances (F) --> substance (PI) , 

% more substances (Conj , Plist) , 

% { Conj - [] , Plist = [] , F = PI ; 

% Conj \= [] , 

% mergemods (PI, Plist, Args) , 

% frame (F, relation, Conj ,Args) 

% }. 

% substances (F) --> substanceswithmocis (F) . 

% substances (A) --> 

% proteins (A) . 

% subswithmods . txt 

% substances is saved in a symbol table (st) ; 
% check for success/failure 1st 

% Case where substances is in st and has been successful 
substances (Fmt) --> checkst ( substances s , Fmt) . 
% Case where substance is in st as a failure, 
substances {_} --> checkst (substances ,_, f,_) , {!, fail}. 

substances ( F) - - > 

snoop (SO, SO) , 
{ \+ checkst (substances , 1 , s,_, SO ,_)} , 
Imods (Lmods) , % left modifiers 

(severalsubstances ( [relation, Conj, First | Rest] ) , % conjoined substances 

rmods (Rmods) , % right modifiers 

% create list of lists containing distributed mods, of substances 

{ distributesubs (Dist , [First | Rest] , Lmods, Rmods) , 
% check Lmods - "no" Fl or F2 should be changed to no Fl and no F2 

fixconj (Lmods, [rel,Conj] , [rel,C2] ) , 
%splice ( [Conj ,Dist] ,F) 

frame (F, relation, C2 , Dist) } ; 
% substances and modifiers without conjunction 

substance (Dl) , 

rmods (Rmods) , 

{Dl = [Typel, Substancel [ModsDl] , 
delete (ModsDl, [] , ModsD2) , 
append ( [Lmods , Rmods] , ModsD2 , Allmodsl) , 
delete (Allmodsl, [] , Allmods2) , 
frame ( F , Typel , Substancel , Allmods2 ) } ) , 
snoop ( S , S ) , 
{addst (substances, 1 , s , F, SO , S) } . 

/* substances (F) --> snoop (SO , SO) , 

{\+ checkst (substances, 3 , s,_, S0,__) } , 

complex (F) , 
{addst (substances, 3, s,F,S0,S) } . 

*/ 

% no more substances- save failure 

substances (_) --> addst (substances, 0, f,_) , {1/ fail}. 



IS 



severalsubstances (F) 



substance (PI) , 
moresubstances (Conj , Plist) , 
Conj = [] , Plist = [] , F = PI 
Conj \= [] , 

addmod (PI, Plist, Args) , 
frame (F, relation, Conj , Args) 



k » X, Y, and Z f 
moresubstances (Conj , 



Args) --> sem_conj rest (Conj 1) , 
substance (PI) , 

moresubstances (Conj 2, Plist) , 
{ Conj2 = [] , Plist = [], Conj = Conjl, Args = [PI] 
Conj 2 \= [] ,Conj2\= /, Conj = Conj 2, 
addmod(Pl, Plist, Args) 

}• 



% to allow for substances with modifiers 
moresubstances (Conjl, Args) --> sem_conj rest (Conjl) , 

substances (Args) , { ! } . 



moresubstances ([],[]) --> [] . % no conjunction 



% distributesubs 

% distributes left mods and right mods over list of findings creating 
% list of lists of findings with mods 
distributesubs ([],[] ,_,_) 

distributesubs (Dist, [Dl | Tail] , Lmods , Rmods) :- 

distributesubs (Dist2, Tail, Lmods,Rmods) , %distributed for remainder 
Dl = [Typel, Substancel |ModsDl] , 
append ( [Lmods , Rmods] , ModsDl , Al lmods 1) , 
delete (Al lmods 1, [] ,Allmods2) , 
frame (D, Typel, Substancel, Allmods2) , 

append ( [D] ,Dist2, Dist) . % Combine findings to get list of findings 

lmods (A) --> stateterm(F) , 

{frame (A, state, F, [] ) } . 
lmods ( [] ) --> sem_measure (_) . 
lmods ([]) --> [] . 
rmods ([]) --> [] . 

stateterm(F) --> acclex (state , F) . 

% for past participle of createbond and breakbond actions, the target 
% is the word. ex. : phosphorylated, dephosphorylated, methylated 
stateterm(F) - - > 

snoop (SO , SO) , % get the initial string 

createbondterm ( [ven] , __,_)/ 

{SO = %get the first word of the string 

stateterm(F) - - > 

snoop(S0,S0) , % get the initial string 
breakbondterm ( [ven] , _,_) , 

{SO = [F|_]}. %get the first word of the string 
% may have to add attachterm for 'bound' 



% Taken from MedLEE grammar to handle '3 cm' 
sem_measure (M) - - > 

sem_premeasure , 
sem__quantityterm (N) , 
opt dash, 

sem_measureterm (Unit ) , 
{ frame (M, measure, [N,Unit] , [] ) }. 
% complex predicates added November 8, 1999 
% CrkL-C3G complex 
% ras : raf-1 association 
% ras: raf-1 complexes 
% shc-grb2-sos 
% TCR/CD3 complex 

% p/CAF-p/CIP-CBP/p300-SRC-l complex 
% Ras: Raf-1 complexes 
complex(C) --> proteins (P) , 

{P = [A,B|_] ,A \- [] , B \= [3 }, 
optcomplexword, 
{ frame (C, complex, [P] , [] ) }. 

% a complex of NFAT4 with calcineurin 
complex (C) --> complexword, 

complexarg (A) , 

{frame (C, complex, [A] , [] ) } . 

complexarg (A) --> [of], proteins (A) . 

complexarg (A) --> [between], proteins (A) . 

% a complex between MyD88, IRAK- 2, and the IL-IRs 

complexarg (A) --> action (contain) , proteins (A) . 

% Complexes containing BOB.l/OBF.l and Oct proteins 

proteins(P) --> protein (A) , 

moreproteins (PI) , 

{{A\=[] ; append ([A], PI, P) ) } . 

moreproteins (A) --> proteinconnector, 

proteins (A) . 



moreproteins ( [] ) 
proteinconnector 
proteinconnector 
proteinconnector 
% connector --> 
% connector --> 
proteinconnector (C) 
optconnector --> 
optconnector --> 



-> [] . 
--> ['-'] 
--> ['/*] 
--> [':'] 
[' , '] ■ 
[and] . 

--> [with] . 
proteinconnector . 
[] . 



taken out not to conflict with relation in 

moresubstances 



complexword - - > 
complexword --> 
complexword --> 



[complex] . 

[complexes] . 

[ 1 signaling complexes ' ] 



optcomplexword 
optcomplexword 



--> complexword. 
--> [] . 



substance (A) 



--> protein (A) . 



substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 
substance (A) 



--> cell (A) . 
--> species (A) . 
--> structure (A) . 

- - > domain (A) . 
--> gene (A) . 

--> geneorprotein (A) 
--> aminoacid(A) . 
--> smallmolecule (A) 
- -> matter (A) . 
--> proteinsite (A) . 

- - > disease (A) . 
--> complex (A) . 



this will be modified later 



protein (A) --> 

proteinterm (P) , 

{frame (A, protein, P, [] ) } . 

complex (A) --> 

complexterm (P) , 

{frame (A, complex, P, [] ) } . 

cell (A) --> 

cellterm(P) , 

{ frame (A, cell , P, [] ) } . 

species (A) - - > 

speciesterm (P) , 

{frame (A, species, P, [] ) } . 

structure (A) --> 

structureterm(P) , 

{frame (A, structure, P, [] ) } . 

domain (A) --> 

domainterm(P) , 

{frame (A, domain, P, [] ) } . 

gene (A) - - > 

geneterm(P) , 

{frame (A, gene, P, [] ) } . 



geneorprotein (A) --> 
gpterm (P) , 
[X] , 

{ (X = gene, frame (A, gene, P, [] ) ; 

X = protein, frame (A, protein, P, [] ) ; 

X\= gene, X \= protein, frame (A, geneorprotein, P, []))}. 



aminoacid (A) --> 

aminoacidterm (P) , 

{frame (A, aminoacid, P, [] ) } . 

smallmolecule (A) --> 

smallmoleculeterm (P) , 

{frame (A, ' small molecule 1 , P, [] ) } - 



matter (A) --> 



1% 



matterterm (P) , 

{ frame (A, substance , P , [ ] ) } . 

proteinsite (A) - - > 

proteinsiteterm (P) , 

{frame (A, 'protein site 1 , P, [] ) } . 

disease (A) --> 

diseaseterm(P) , 
{frame (A, disease, P, [] ) } . 
process (A) --> 

processterm (Syn, F, Features) , 
{frame (A, process, F, []),!} . 
process (A) --> 

processterm (P) , 

{frame (A, process, P, []),!}. 



% terminals 
proteinterm (F) 
complexterm (F) 
cellterm (F) 
speciesterm (F) 
structureterm (F) 
domainterm ( F) 
geneterm (F) 
gpterm (F) 
aminoacidterm (F) 
smallmoleculeterm(F) 
matterterm (F) 
proteinsiteterm (F) 
diseaseterm (F) 
processterm (F) 



acclex (protein, F) . 
acclex (complex, F) . 
acclex (cell, F) . 
acclex (species, F) . 
acclex (structure, F) . 
acclex (domain, F) . 
acclex (gene, F) . 
acclex (gp, F) . 
acclex (aminoacid, F) . 
acclex (smallmolecule, F) 
acclex (substance, F) . 
acclex (proteinsite ,F) . 
acclex (disease, F) . 
acclex (process, F) . 



% action (activate, Syn, F, Features) 

act ivateterm (Syn, F, Features) --> 
attachterm(Syn,F, Features) --> 
breakbondterm ( Syn , F , Features ) - - > 
createbondterm (Syn, F, Features) --> 
inactivateterm (Syn, F, Features) --> 
reactterm (Syn, F, Features) --> 
releaseterm (Syn, F, Features) --> 
signalterm (Syn, F, Features) --> 
substituteterm(Syn, F, Features) --> 
transcribeterm (Syn, F, Features) - - > 
promot et erm ( Syn , F , Features ) - - > 
processterm (Syn, F, Features) --> 
generateterm (Syn, F, Features) --> 
causeterm(Syn, F, Features) --> 



-> activateterm(Syn, F, Features) . 

acclexss (activate, Syn, F, Features) . 
acclexss (attach, Syn, F, Features) . 
acclexss (breakbond, Syn, F, Features) . 
acclexss (createbond, Syn, F, Features) 
acclexss ( inactivate , Syn , F , Features ) 
acclexss (react, Syn, F, Features) . 
acclexss (release, Syn, F, Features) . 
acclexss (signal, Syn, F, Features) . 
acclexss (substitute, Syn, F, Features) 
acclexss (transcribe, Syn, F, Features) 

acclexss (promote, Syn, ^, Features) . 
acclexss (process, Syn, F, Features), 
acclexss (generate, Syn, F, Features) . 
acclexss (cause, Syn, F, Features) . 



% Semlist contains a phrase which is an action 
actionchk (Semlist) : - 

intersect (Semlist, [attach, cause, createbond, breakbond, activate, 

inactivate , substitute , transcribe , express , promote , s ignal ] ) 



% Semlist contains a phrase which is a connector action 



connectchk (Semlist) :- 

intersect (Semlist, [cause, activate, inactivate, substitute, 

promote, signal] ) . 



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%^%%^%^ %%%%%%%%%%%% 
% Genome sectionc : ends here % 

rtnnrtnnrioonnooooO(ieiOOOOOOODOOOOOOOOOOOOOOOOOOOOOOOOOg rf 5,0»9 rf g,3-g,g-3.3,S.S-3-g-3- 

"o*o*^"o*^'o"o"S*5"^*3^"S"^*^"o'S*5"o"o"o"o'^*o"^*o"S"5'5"^"5 O O O 0 O O O O O O 0 O O O O O O O O O O O 0 O O O O O O O O O 0 O O O 

% relations are connected by conjunctions, or 
% certain 'conn' prepositions. 

% Taken from MedLEE grammar to handle connectives that are conjunctions 
% Ex: "severe markings, possibly from tuberculosis" 

sem_relation (F, [] ) --> % relation and modifiers 

sem_commapunc , 

sem_certainty ( [] ,C,rel) , 

prepterm (P, conn) , 

{frame (F, rel, P, C) } . 

%plice( [[rel,P] ,C] ,R) . 

% Ex: "markings, swelling", "markings and swelling" 

sem_relation (R, [] ) --> sem__conj rel (R) , 

s em_c ommapunc . 
% "density may represent known tumor" 

% "markings, and swelling" 
sem__conj rel (F) --> 

sem_commapunc , 

sem__conj term (Conj ) , 

{ frame ( F , rel , Conj , [ ] ) } . 

sem_conj rest (Conj) --> % restricted conj, has not sem_relation_showopt 

sem_commapunc , 
sem_conj term (Conj ) . 
% "markings, swelling" 
sem_conjrest ( 1 , ? ) --> 
snoop (SO, SO) , 

sem_c ommapunc , 
snoop ( S , S ) , 
{SO \= S}. 

% Treatment of Verbs from MedLEE ' s Grammar 

% form of "be" 

sem_auxverb (B) --> sem_beterm (B) . 

% form of "do" 

sem_auxverb (B) - - > sem__doterm (B) . 

% form of "have" 

sem_auxverb (B) --> sem_haveterm(B) . 

sem__recrel --> prepterm (in, . 
sem_recrel --> prepterm (to, _) . 
% "is not" 

sem_auxrel (V) --> sem_auxverb (_) , 
sem_negterm (V) . 
sem_auxrel (V) --> sem_auxverb (V) . 

% left modifiers of findings include negation, quantity, certainty, degree, and 
% change type modifiers 
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sem_integer (W) - - > [W] , { integer (W) } . 
sem_integer (W) --> integerterm (W) . 
sem_timeunit (T) --> sem_timeunitterm (T) . 

% From MedLEE grammar - "lasting 2 days", "for 2 days", "times 2 days" 
sem_duration (F) - - > 

sem_dur preps , 

sem_premeasure , %about 

sem_timemeasure (T) , 

sem__durationmod, % opt. - "in duration" 
{frame (F, duration, [T] , [] ) } . 
sem_duration { [] , S , S) . 

sem_durpreps --> [times]. 
sem_durpreps --> 

prepterm (for,_) . 
sem_durpreps --> [lasting, for] . 
sem_durpreps --> [lasting] . 
sem_durpreps --> [lasted, for] . 
sem_durpreps --> [lasted]. 
sem__durat ionmod - - > 

sem_aposts , %opt . - " 1 s " 
[duration] . 
sem__durat ionmod --> [in] , [duration] . 
sem_dur at ionmod --> [] . 
sem_aposts --> [' ' ' r ] , [s] . 
sem_apost --> [] . 

% sem__f requency taken From MedLEE ' s grammar 

% "two times", "times two", "two times a/per week", "two times daily" 
sem_f requency (F) - - > 

sem_freqterm(Fl) , % "once" 

sem_freqterm{F2) , % "a day" 

{frame (M, unitval , [F1,F2], [] ) , 
frame (F, frequency, [M] , [] ) } . 

sem_f requency (F) --> 

sem_freqterm(M) , % "qid", "daily" 
{frame (F, frequency, M, [] ) } . 

% "2 times", 

sem_f requency (F) - - > 

sem_premeasure , 

sem_quantityterm{M) , 

sern_times, 
{frame (F, frequency, [M] , [] ) } . 

% "times 2" 
sem_f requency (Q) - - > 
sem_times , 

sem_quantityterm(Ql) , 
{ frame (Q, frequency, Ql, [] ) } . 
sem_f requency ( F) - - > 

[q] , sem_quantityterm(Q) , 

sem_timeunit (T) , 
{frame (F, frequency, [unitval, [Q, T] ] , [] ) } . 
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sem_f requency (F) --> sem_eachevery, 

sem_quantityterm (Q) , 
sem_timeunit (T) , 

{ frame (F f frequency, [unitval, [Q, T, every] ] , [])}. 
sem_f requency (Q) --> % "second" 
sem_ordinal (0) , 
sem_timeopt , 

{f rame (Q, frequency, O, [] ) } . 
sem_f requency ( [] ,S,S) . 
sem_timeopt --> [time] . 
sem__timeopt --> [] . 
sem_eachevery --> [each] . 
sem_eachevery --> [every] . 
sem_times--> [times] . 
sem times-->[x] . 



% Taken from MedLEE ' s grammar 

negation modifier - "no" as in "no cardiomegaly" 
sem_negation (F) --> 

sem_negterm (N) , 
{frame (F,neg,N, [] ) } . 
% negation not present 
sem_negation { [] ,30,30) . 

% Taken from MedLEE ' s grammar 

% quantity modifier - "two" as in "two masses" 
sem_quantity (F) - - > 
snoop (SO, SO) , 

{ \+ checkst (sem_dates,l,s,_,S0,_) }, % not a legitimate date 
sem__quantityterm(Q) , 

sem_quantityrmod(_) , % "2 or 3", "2 to 3" 

{\+ next_wordunit (SO) , % rule out '2 mm' 
f rame (F, quantity, Q, [] ) 

}- 

sem_quantity ( [] , SO, SO) . 



sem_commapunc { [ 1 , ' | S 
sem_commapunc (S, S) . 
sem_conj term(C) 
sem_doterm (D) 
sem_endmark ( [ . | S] , S) 
sem_endmark ( [ ; | S] , S) 
sem_f reqterm(F) 
sem_haveterm (H) 
integerterm { I ) 
sem_measureterm (M) 
sem_medterm (M) 
sem_negterm (N) 
prepterm(P, C) 
sem timeunitterm (T) 



,S) . 

-> acclex (conj , C) . 
-> acclex (vdO/D) . 



-> acclex (freq, F) . 

-> acclex (vhave, H) . 

-> acclex (integer , I) . 

-> acclex (unit ,M) . 

-> acclex (med,M) . 

-> acclex (neg,N) . 

-> acclex (p, [P,C] ) . 

-> acclex (timeunit, T) . 
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% lexog - adapted from MedLEE lexicon 

%%%%%%%%%%%%%%%%%%% CLOSED WORD CATEGORY LEXICON %%%%%%%%%%%%%%%%%%%%%%%% 

%%%%%%%%%%%%%%%%%%%%% NEGATIONS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%*%%%%% 

: -unknown (_, fail) . 

: -multifile (wdef/3) . 

wdef (cannot, neg, no) . 

wdef (neither, neg, no) . 

wdef (never , neg, no) . 

wdef (no,neg,no) . 

wdef (non, neg, no) . 

wdef (none, neg, no) . 

wdef (not , neg, no) . 

wdef (nothing, neg, no) . 

%%%%%%%%%%%%%%%%%%%%% CONJUNCTIONS %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% 

wdef ( ' & 1 , con j , and) . 

wdef ( ' / ' , conj ,or) . 

wdef ( 1 - ' , grammar, T - ' ) . 

wdef ( ' + 1 , conj , and) . 

wdef (although, conj , and) . 

wdef (and, conj , and) . 

wdef (as , conj , and) . 

wdef (because, conj , and) . 

wdef (but , conj , and) . 

wdef ( 1 , ' , conj ,',')• 

wdef (except , conj , no) . 

%wdef (if , grammar , if ) . 

wdef (minus , conj , no) . 

wdef (nor , conj , no) . 

wdef (or, conj , or) . 

wdef (that , grammar , that) . 

wdef (though, conj , and) . 

wdef (thru, conj , and) . 

wdef (verses , conj , or) . 

wdef (versus , conj , or) . 

wdef (vs, conj , or) . 

wdef (when , grammar , when) . 

wdef (where, grammar, where) . 

wdef (whereas , conj , and) . 

wdef (which, grammar, which) . 

wdef (while, conj , and) . 

wdef (who, grammar, who) . 

wdef (yet , conj , and) . 

%%%%%%%%%%%%%%%%%%%%% PREPOSITIONS %%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

wdef (above, ploc, above) . 

wdef (about, p, [approximately, nconn] ) . 

wdef (about , ploc, about) . 

wdef (across, ploc, across) . 

wdef (abutting, ploc, near) . 

wdef (accompanies, p, [with, conn]) . 

wdef (accompanying, p, [with, conn] ) . 

wdef (adjacent , ploc, adjacent) . 

wdef (adjacent , region, adjacent) . 

wdef (after , p, [after, conn] ) . 

wdef (after , tprep, after) . 

wdef (along, p, [on, nconn] ) . 

wdef (approximately, p, [approximately , nconn] ) . 
wdef (around, p, [approximately, nconn] ) . 
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wdef (at,p, [at, nconn]) . 

wdef (atop,p, [on, nconn] ) . 

wdef (before, ploc, before) . 

wdef (before, tprep, before) . 

wdef (behind, ploc, behind) . 

wdef (below, ploc, below) . 

wdef (between, ploc, between) . 

wdef (beyond, ploc , beyond) . 

wdef (by, ploc, near) . 

wdef (despite, p, [with, conn] ) . 

wdef (during, p, [during, conn] ) . 

wdef (during, tprep, during) . 

wdef (encasing, ploc, encasing) . 

wdef (extending, p, [in,nconn] ) . 

wdef (following, p, [after, conn] ) . 

wdef (following, tprep, after) . 

wdef (for, p, [for,nconn] ) . 

wdef (from,p, [from, conn] ) . 

wdef (in, p, [in,nconn] ) . 

wdef (including,p, [with, conn] ) . 

wdef (into, p, [in,nconn] ) . 

wdef (involving, p, [of,nconn] ) . 

wdef (next, tprep,next) . 

wdef (occupying, p, [in # nconn] ) . 

wdef (on, p, [on,nconn]) . 

wdef (of, p, [of,nconn]) . 

wdef (over, ploc, over) . 

wdef (overlie, ploc, over) . 

wdef (overlied, ploc, over) . 

wdef (overlies, ploc, over) . 

wdef (overlying, ploc, over) . 

wdef (prior, tprep, before) . 

wdef (near, ploc, near) . 

wdef (radiating, ploc, radiating) . 

wdef (regarding, p, [about , nconn] ) . 

wdef (roughly, grammar, roughly) . % 'roughly 6 mm' 

wdef (since, p, [since , conn] ) . 

wdef (since, status, subsequent) . 

wdef (through, p, [in, nconn]) . 

wdef (throughout, p, [in, nconn] ) . 

wdef (to, p, [to, nconn]) . 

wdef (toward, p, [to,nconn] ) . 

wdef (towards ,p, [during, conn] ) . 

wdef (under, ploc, below) . 

wdef (underneath, ploc, below) . 

wdef (until, tprep, until) . 

wdef (up grammar, up) . 

wdef (upon,p, [on, nconn]) . 

wdef (via, p, [with, conn] ) . 

wdef (with, p, [with, conn] ) . 

wdef (within, p, [in, conn] ) . 

wdef (without, p, [no, conn]) . 

% wdef (without, neg, no) . 

%%%%%%%%%%%%%%%%%%%%%%%%%% UNITS OF MEASURE %%%%%%%%%%%%%%%%%%%%%%%%%%%*% 
wdef ( ' % f , unit .percent) . 



wdef (cc, unit , cc) . 
wdef (centimeter, unit .cm) . 
wdef (centimeters, unit, cm) . 
wdef (cm, unit , cm) . 
wdef (degrees, unit, degree) . 
wdef (gm, unit, gram) . 
wdef (gms, unit, gram) . 
wdef (gram, unit, gram) . 
wdef (grams , unit , gram) . 
wdef (kg, unit , kilogram) . 
wdef (kilo, unit, kilogram) . 
wdef (kilogram, unit , kilogram) . 
wdef (kilograms , unit , kilograms) . 
wdef (liter, unit, liter) . 
wdef (liters, unit, liter) . 
wdef (microgram, unit , microgram) . 
wdef (micrograms, unit, microgram) . 
wdef (milliliter, unit, ml) . 
wdef (milliliters, unit, ml) . 
wdef (milligram, unit ,mg) . 
wdef (milligrams, unit, mg) . 
wdef (milliseconds , unit , millisecond) 
wdef (millivolts, unit, millivolt) . 
wdef (ml, unit, ml) . 
wdef (millimeter, unit, mm) . 
wdef (millimeters, unit, mm) . 
wdef (mm, unit , mm) . 

wdef (ozs, unit , ounce) . 

wdef (percent, unit, percent) . 

%%%%%%%%%%%%%%%%%%%%%%%%% NUMBERS o 

wdef (half , integer, 'one half 1 ) . 

wdef (semi, quantity, semi) . 

wdef (ii, integer, 2) . 

wdef (iii, integer, 3) . 

wdef (vi, integer , 4) . 

wdef (v, integer, 5) . 

wdef ( vi , integer , 6 ) . 

wdef (vii, integer, 7) . 

wdef (viii, integer, 8) . 

wdef (ix, integer, 9) . 

wdef (xii, integer, 12) . 

wdef (xiii, integer, 13) . 

wdef (one, integer, 1) . 

wdef (two, integer, 2) . 

wdef (double, quantity, double) . 

wdef (three, integer, 3) . 

wdef (four, integer, 4) . 

wdef (quadruple, quantity, quadruple) . 

wdef (five, integer, 5) . 

wdef (six, integer, 6) . 

wdef (sixty, integer, 60) . 

wdef (seven, integer, 7) . 

wdef (eight , integer, 8) . 

wdef (nine, integer, 9) . 

wdef (ten, integer, 10) . 

wdef (eleven, integer, 11) . 

wdef (twelve, integer, 12) . 
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wdef (thirteen, integer, 13) . 
wdef (fourteen, integer, 14) . 
wdef (fifteen, integer, 15) . 
wdef (sixteen, integer, 16) . 
wdef (seventeen, integer , 17) . 
wdef (eighteen, integer , 18) . 
wdef (nineteen, integer, 19) . 
wdef (twenty, integer, 20) . 
wdef (thirty, integer, 30) . 
wdef (forty, integer, 40) . 
wdef (fifty, integer, 50) . 
wdef (sixty, integer, GO) . 
wdef (seventy, integer, 70) . 
wdef (eighty, integer, 80) . 
wdef (ninety, integer, 90) . 
wdef (hundred, integer, 100) . 
wdef (thousand, integer, 100 0) . 
wdef (million, integer, 1000000) . 
wdef (billion, integer, billion) . 
wdef (zero, integer, 0) . 
wdef (first , ointeger, 1) . 
wdef (second, ointeger, 2) . 
wdef (third, ointeger , 3) . 
wdef (fourth, ointeger, 4) . 
wdef (fifth, ointeger, 5) . 
wdef (sixth, ointeger, 6) . 
wdef (seventh, ointeger, 7) . 
wdef (eighth, ointeger , 8) . 
wdef (ninth, ointeger, 9) . 
wdef (tenth, ointeger , 10) . 
wdef (eleventh, ointeger, 11) . 
wdef (twelvth, ointeger, 12) . 
wdef (thirteenth, ointeger, 13) . 
wdef (fourteenth, ointeger, 14) . 
wdef (fifteenth, ointeger, 15) . 
wdef (sixteenth, ointeger, 16) . 
wdef (seventeenth, ointeger , 17) . 
wdef (eighteenth, ointeger, 18) . 
wdef (ninteenth, ointeger, 19) . 
wdef (triple, quantity, triple) . 
wdef (twentieth, ointeger, 2 0) . 
wdef (thirtieth, ointeger, 3 0) . 
wdef (single, quantity, 1) . 
wdef (solitary, quantity, 1) . 



wdef (frequency, grammar, frequency) . */ 
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%%%%%%%%%%%%%%%%%%%%%%%%% FREQUENCIES %%%%%%%%%%%%%%%%%%%%%%%%%%%%^^^^ %%%%%%%% 
wdef (once, freq, 1) . 
wdef (times, grammar, x) . 



wdef (twice, freq, 2) . 
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% lexicon with lexOg containing common English words adapted from lexO of 
MedLEE% 

% lexlg from lexl of MedLEE 
August 23, 1999 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

CAROL FRIEDMAN 
QUEENS COLLEGE, COLUMBIA UNIVERSITY 

Version 3.0 4-01-00 
Version 2.0 1-31-96 
Version 1.0 1-5-92 



SEMANTIC LEXICON FOR CLINICAL TEXT 

The lexicon consists of several files : 
lex0g.pl: single word closed classes 
lexlg.pl: single word - general modifier type words: 

% 

wdef (category, target) . 

word - is the name of the word being categorized; 
category - is the semantic category for the word 
target - is the canonical /standard form for the word 

words which are synonyms should be assigned the 
canoni cal form . 
multi-word phrases are categorized as follows: 
phrase (word, category, phrase, target) . 

Semantic Categories: 

certainty "possible" 

canonical values limited to: moderate - for possible 

high - for high possible 
low - for low possible 

conj - relational operators "and", "or" , which connect one finding 

to another finding 
neg - negation "no" ; "not" 

quant - for quantitative information "many" % 
: -unknown fail) . 

: -ensure_loaded ( [nsphrase, lexOg, lexlg, lexsemact , lexsyn, lexsub] ) . 
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% definitions kept from MedLEE lexicon - lexl.pl 

wdef (be, vbe, 'high certainty') . 

wdef (been, vbe, ' high certainty'). 

wdef (being, vbe, 'high certainty') . 

wdef (was, vbe, 'high certainty 1 ) . 

wdef (is, vbe, 'high certainty') . 

wdef (were, vbe, 'high certainty') . 

/* 

wdef (became, vcertainty, 'high certainty') . 
wdef (become, vcertainty, 'high certainty') . 
wdef (becomes, vcertainty, 'high certainty') . 
wdef (becoming, vcertainty, 'high certainty') . 

put in action lexicon 
wdef (changed, change , change) . 
wdef (changes, change , change) . 
wdef (changing, change, change) . 
wdef (necessarily, certainty, 'high certainty') . 
wdef (necessary, vrecommend, recommended) . 
wdef (necessitate, vstatus, need) . 
wdef (necessitated, vstatus, need) . 
wdef (necessitating, vstatus, need) . 
wdef (necessitates, vstatus, need) . 
wdef (need, vstatus, need) . 
wdef (needed, vstatus , need) . 
wdef (needing, vstatus, need) . 
wdef (needs, vstatus, need) . 
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% file ml_parser.pl 
:- multifile (phrase/5) . 
:- multifile (wdef/3) . 
: -unknown (__, fail) . 

% Load in program components - library components are part of Prolog 
:- ensure_loaded ( [library (basics) , library (not) , library (lists) , 

library (readin) , library (strings) , library (c types) , library (readconst) , 
library (date) , library (lis tparts) , library (sets) , 
radrec,radpardb,useful,util, tagging, lexicon, gengram] ) . 

%:- initialization run. 

%run :- on_exception (Error, processrun, stop (Error) ) . 
runtime_entry (start) :- processrun. 
runt ime__en t ry ( abo r t ) : - hal t . 

% process report 

processrun :- process, halt. 

%stop (Error) :- 
% told, 

% write (user_error, 1 Error : *), write (user_error, Error) , halt. 

% get user supplied parameters and process report 
process : - 

get_args (Mode, Inf ile, Outfile, Prb, Undefs , Protocol) , I , 
(Examtype = [] ; % must have a domain 
process (Inf ile, Outfile, Prb, Undefs) ) . 

% open Inf ile (text input) and process 
process (Inf ile, Outfile, Prb, Undefs) : - 

see (Inf ile) , seen, see (Inf ile) , 

on_exception (Error, 

test_genome (Outfile, Prb, Undefs) , 

app_errO (_, Outfile , Error) ) , 
closefiles (Outfile, Prb, Undefs) . 
process (_, Outfile, __,_) :- 

app_err (_, Outfile, 'Program failed') . 

app_errO (_, Output , Error) : - 
tell (Output) , 
write ( ' <error> ' ) , 

write (' Prolog Error occurred: ')/ 

app_err (_, Output , Error) . 
app_errl (_, Output , Error) : - 

tell (Output) , 

write ( • <error> r ) , 

write ( * Error in input : ' ) , 

app_err (_, Output , Error) . 
app_err (_, Output, Error) :- 

tell (Output) , 

write (Error) , write ( f </error> T ) , nl . 

closefiles (Outf ile, Errf ile, Unfile) : - 
tell (Outfile) , told, 
(Errf ile = [] ; tell (Errf ile) , told), 
(Unfile = [] ; tell (Unfile) , told) . 
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% Argument options - get user defined arguments 

% -p ProbFile (otherwise default is problem messages are not written to file) 
% -i Infile (if input is supplied by file and not standard input 
% -s Section (default is impression) 

% -m Mode (default is relax; the three choices are strict, relax, skip) 
% -o Outfile (if output should be file and not standard output) 
% -? Provide list of default arguments 

% -u Undefs (otherwise default is - undefined messages are not written 
% to a file) 

get_args (Mode , Infile, Outfile , Prbfile, Undefs , Protocol) : - 
unix (args (Args) ) , 
(Args = [] , !, writesyntax; 
Args = [ 1 ? 1 ] , ! , writesyntax; 
Args = [X | Rest] , ! , 

set_args ( [x|Rest] ,Mode, Infile, Outfile, Prbfile, Undefs, Protocol) ) . 

writesyntax : - 

write (user_error, 'geneparser [-m Mode] ') , 
nl (user_error) , 

write (user_error, ' [-t Outtype] [-p Probfile] [-u Undefs] ' ) , 

nl (user_error) , 

write (user_error, 1 [-i Infile] [-o Outfile]'), 

nl (user error) . 
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% nsphrase.pl - contains words/phrases that are ignored 

nosem (both, [both] ) . 

no s em (however , [however]) . 

nosem (selectively, [selectively] ) . 

nosem (specifically, [specifically]) . 

nosem(the, [the] ) . 

nosem (a, [a] ) . 
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% file radpardb.pl 

% June 25, 1999 

% fail an unknown predicate 

: -unknown (__, fail) . 
:- op(900, fy, [not, once] ) . % same priority and type as \+ 
:- op(700, xfx, % same priority and type as = or == 

:- dynamic (sentno/1) . 
% \sem\ radpardb.pl 

%parse_sentences (+Beg, -Fmt, -ParseErrors , -Undefineds, -Unsents, +Section, 

% +UserMode , +Examtype , Sentno , Outsno , IncSno) 

% Beg is list of sentences, Fmt is list of target forms, 

% ParseErrors are a list of sentences which could not parse, 

% Undefineds is a list of undefined words in sentence 

% Unsents is a list of sentence containing undefined words 

% Section is the section of the examination, UserMode is the 

% parsing mode specified by user, 

% Examtype is the domain (type of exam) 

% Sentno is the number of the starting sentence 

% Outsno is the last sentence number + 1 

% IncSno is the amount that the sentence number should be increased 

% (i.e. it is 1 when called by parse_sects and 0 when in 

% recovery mode) 

% Each sentence is parsed independently. 

parse_sentences( [] ,[],[],[],[] ,_,_,_,_f_/J 1 * %no more sentences 
parse_sentences (Beg, Fmtlist , Outf ail , Outundef s , OutunSents , 
Section, UserMode, Examtype, IncSno) : - 
get__sentence (Beg, S, Rest) , I , 

( isidentifier (S) , I, % ignore identifier sentences - parse remainder 
parse_sentences (Rest, Fmt 1 , Outf ail , Outundef s , OutunSents , 
Section, UserMode, Examtype, IncSno) , I , 
(outputform(htext) , S \= ['.'], !, IncSno \= 0, %0 means in recovery 

mode 

append ( [[ [sentence, S] ] ] , Fmt 1, Fmtlist) ; 
Fmtlist = Fmtl 

) 

7 

%( IncSno = 0, !; % on same sentence in recovery mode 

% sentno (Sno), NewSentno is Sno + IncSno, 

% retract (sentno (_) ) , assert (sentno (NewSentno) ) 

g, o, \ 
"o o ) t 

% Incsno = 1, write {'***') , write_list (S , 3 , _) , nl, !, 
% Incsno = 0, 

preprocess(S,Bs,Undef ,Semlist, strict) , % bracket and check for undefineds 

parse_modes ( S , Bs , Semlist , Fmtl , Errors , Undef , Unsents , Section , Writef ail , 
Examtype, UserMode, IncSno) , % parse first sentence 

parse_sentences (Rest, Fmt2 ,Moreerrors,Moreundef s,MoreUnSents, 

Section, UserMode, Examtype,_,_, IncSno) , % parse remaining 
append (Errors, Moreerrors, Outf ail ) , % Combine failures 

(output form (htext) , 

(Fmtl \= [] , IncSno \= 0, 
!, append ( [Fmtl] ,Fmt2, Fmtlist) ; % add extra bracket for 1st 
Fmt2 = [] , Fmtlist = Fmtl , ! 
) 
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%in recovery mode 
kin recovery mode 



append (Fmtl , Fmt2 , Fmtlist) 
) , % Combine targets 

append (Unsents , MoreUnSents , OutunSents ) , % Combine sentences 
append (Undef ,Moreundef s, Outundef s) % Combine undefined words 

) . 

%parse_modes (+S, +Bs, +Semlist, -Fmt, -Failures, +Undef , -Unsents, -^Section, 
% 4-WriteMessage, + Examtype, +Mode, +IncSno) 

% S is original sentence; Bs is sentence after lexical lookup 

% Semlist is list of semantic categories in sentence 

% Fmt is formatted output, 

% Failures is list of sentences /fragments which could not be parsed. 

% Undef are words not in lexicon, Unsents are sentences containing 

% undefined words 

% Section is name of section being processed 

% WriteMessage is message returned from doresult (in case doresult fails 

% Examtype is domain, Mode is user specified mode 

% IncSno is 0 if this is a fragment of a sentence that was already 

% parsed - but unsuccessfully; is 1 if this is a new sentence 

% Best possible - try to get the most accurate parse possible trying 
% all alternative strategies in turn if neccessary 
% All words in sentence are defined 

parse_modes(S,Bs, Semlist, Fmt, Errors, [] , [] , Section, no, Examtype, Pmode, 
Inc) :- 

(Pmode = bpseg, Pmodemod = mode2, 
Pmode = bpseg2, Pmodemod = mode2 , 
Pmode = bpseg3, Pmodemod = mode2, 
Pmode = bpskip , Pmodemod = mode4 , _ , . 

% in user specified parse mode - don't parse in mode 5 or keyword 
Pmode \= keyword, Pmode \= mode 5, 
Pmodemod = model 

dosent(S,Bs, Semlist, Fmtl, Message, Section, Examtype, Pmodemod, _) , !, % 
strict first 

recovery (_,S,Bs, Semlist, Fmt2, Message, Errors, [] , [] , Section, 

Pmode, Examtype, J , % try alternative modes if neccy 
(outputform(htext) , Inc \= 0, I, append ([[ [sentence, S] ], Fmtl, Fmt 2] , Fmt) 
append (Fmtl , Fmt 2 , Fmt) 
) . 

% alternative strategies if have undefined words 

parse_modes ( S , Bs , Semi ist , Fmt , Errors , Undef , Unsents , Section , no , Examtype , 
Pmode, Inc) :- 
Undef \= [] , 

recovery (_, S , Bs , Semlist , Fmtl , yes , Errors , Undef , Unsents , Sectxon , 

Pmode, Examtype, J , % try alternatives if have undef ineds 
(outputform(htext) , Inc\= 0, ! , append ( [ [sentence, S] ] , Fmtl, Fmt) ; 
Fmt = Fmtl 
) . 

% key word strategy is fastest but least reliable; 

parse_modes ( S , Bs , Semlist , Fmt , Errors , Undef , Unsents , Sect ion, no , Examtype , 
Pmode , Inc ) : - 
(Pmode = keyword; Pmode = modeB 
; Pmode = modes) , 

recovery ( 5 , S , S , Semlist , Fmtl , yes , Errors , Undef , Unsents , Section , Pmode , 
Examtype , __) , 

(outputform(htext) , Inc \= 0, i, append ([ [sentence, S] ], Fmtl, Fmt) ; 
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Fmtl = Fmt 
> . 

% Parsing/Recovery modes 

% parse_modes ( +Level , +S , +Bs , +Sem, -Fmt , +Failed, +Undef , +Unsents , +Section , 

% + Pmode , +Examtype , _) 

% Level is the recovery level of the predicate 

% S is the original sentence list 

% Bs is the 

% Sem is the list of semantic categories in the sentence 
% Fmt is the formatted output for the sentence 

% Failed is 'yes* if the parse was unsuccessful, and 'no 1 otherwise 

% Undef is a list of words in sentence which are undefined (not in lexicon) 

% Unsents are the lists of sentences /segments which could not be parsed. 

% Section is the section of the report 

% Pmode is the user specified parse mode 

% Exam type is the domain 

% mode 1 is the strictest parsing mode - the parser succeeded for the complete 
% original sentence using the grammar; all words in original sentence 

% are defined in lexicon 

% mode 1 - alternative not needed because parse succeeded 

recovery (1, □ ,no, [] , Undef , Unsents _) ! * 

% - no alternative strategy allowed in mode 1 

% in case where there are no undef ineds, Noparse is S 

recovery (1,S,_,_, [] ,yes,S, [] , [] ,_, Pmode, _) :- 

Pmode = strict; Pmode = model, I. 
% in case there are undef ineds, Unsents is S 

recovery (1,S,_,_, [] , yes, Noparse, Undef , Unsents ,_, Pmode, __,_) :- 

(Pmode = strict; Pmode - 'model'), 

Undef \= [] , Unsents = S, Noparse = [] , ! - 
recovery (1,S,_, Semi is t, [] ,yes, S, _,_,_) 

% sentence contains no relev. information, don't try to recover 
% \+ (subtype (finding, Semlist) ; subtype (time, Semlist) ) , ! . 

\+ actionchk (Semlist) . % april 23, restored 

% mode 4 - skip undefined words and try to parse according to mode 1 
recovery (4, S,_,_, Fmt, yes, Errors, Undef , [] , Sect , Pmode, Examtype,_) :- 
Undef \= [] , 

(Pmode = bp; Pmode = mode4 ; 

Pmode - bpseg; Pmode = bpskip; Pmode = mode4 
) , 

preprocess (S,Bs,_, Semlist , bpskip) , 

dosent (S , Bs , Semlist , Fmtl , Message , Sect , _, Examtype , mode4 ,_),!, 
recovery (_,Bs,Bs, Semlist, Fmt2 , Message, Errors, [] , [] , Sect, 

bpskip, Examtype, Sentno) , % try alternatives if neccy 

append (Fmtl , Fmt2 , Fmt) . 

% mode 3 - try longest parsed segment; partition rest of 
% sentence using mode 5 for parse mode bp 

recovery ( 3 , S , Bs , Fmt , yes , Errors , Undef , Unsents , Sect , Pmode , Examtype , _) : - 
% allowable modes for choosing longest segment 
(Pmode = bp; Pmode = bpskip; 

Pmode - skip; Pmode = mode3; Pmode = mode 4 ; 
Pmode = bpseg3; Pmode = bpseg 
) , 

(Pmode = bpskip, Pmodemod = mode4_3; 
Pmodemod = mode 3 

checks t (sem_pattern, _/ s , Target , Bs , Rest) , %check symbol table 
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%dooresult (Target , Fratl , Examtype, Sect , Pmodemod, _) , 
f ormatresult (Target , Pmodemod, Fmtl) , 
(Pmode = modes, Fmtlist = [] , Errors = Rest; 

recovery ( 5 , Rest , Rest , _ , Fmtlist , yes , Errors , Undef , Unsent s , Sect , 
Pmode , Examtype , 

) , 

append (Fmtl, Fmtl is t , Fmt ) . 
% mode 2 segments sentence using word barrier methods. This mode is tried if 
% parse failed for original sentence/or there are undefined words 

% segment sentence using word barriers 

recovery (2 , S , _, Fmt , yes , Errors , Undef , Unsent s , Sect , Pmode , Examtype , J : - 
(Pmode = bp; Pmode = bpskip; Pmode = mode2 ; Pmode = skip; 
Pmode = mode2; Pmode = mode3; Pmode = mode 4 ; 
Pmode = bpseg; Pmode = bpseg2; 
Pmode = bps eg 3 

segmentandpar sets, Fmt, Error s, Unsent s, Sect, Pmode, Examtype, _) , ! . 
% mode 5 - try to partition sentences by findings 
% when a finding in sentence is found, go left until first 
% modifier is found (if 2 findings are next to each other, 2nd one 
% is considered the finding and 1st is considered the modifier) 
% Repeat searching for successive findings using this method 
recovery(5, [] , [] , __, [] , [] ,_, _, _/__/ _) : " 1 * 
recovery ( 5 , S , Bs , Fmt , yes , Errors , Undef , Unsent s , Sect , 
Pmode , Examtype , _) : - 

(Pmode = bp; Pmode = bpskip; Pmode = bpseg; Pmode - keymode; 
Pmode = modes ; Pmode = negmode 

), 

preprocess (S,Bsl,_,_, bpskip) , % skip undefined words 

actionfindingseg (Bsl,Fseg, Before) , !, % get segment containing finding 
(Fseg = [] , Errors = S, ! ; % no finding to segment 
%Before = [] , Errors = Bs, Fmtl =[],!; % this part was tried 
preprocess (Fseg, Bseg, Semlist, bpskip) , 
dosent (Fseg , Bseg , Semi ist , Fmtl , Message , Sect , _, Examtype , 
modes, _) % try to parse finding segment 

(Before = [] , Beforel = [] , Message = yes, !; % no segmenting yet - 

Message - yes, Beforel = Before, !; %don't add ».'; have to skip 



skip beg. 
more 



append (Before, [' . '] , Beforel) 

('Fseg = [] , Fmt = [] , I; % no finding left in sent. - don't recover 
recoverrest (Fseg, _, Beforel , Fmt2 , Message , Errors , 
Sect, Newmode, Examtype , 

% recover remainder 

append ( Fmtl , Fmt2 , Fmt ) 
) . 

% nothing could be recovered; all input -> Errors ; Format is [] 
recovery (_, Sents , [] , yes, Sents, Undef , [] ,_,_#_#_) * 

% part of phrase was skipped, add period and treated skipped part as a 
% sentence 

% recoverrest ( ^Segment , +Semlist , +Bef ore , - Fmt , ^Message , -Failures , +Section , 
% +Mode , +Examtype , _) 

% Segment is part of sentence with a finding 
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% Semlist is a list of semantic categories for that sentence part 

% Before is the part of sentence before Segment 

% Fmt is the format for this segment 

% Message is 'no' if there is no segmantic information to be recovered 

% Message is 'yes' otherwise 

% Failures are lists of segment (s) that could not be parsed successfully 

% Section is section being processed, Mode is user specified parsing mode 

% Examtype is domain 

recoverrest {_,_, Before, [] , no, Beforel,_, _,_/_) 

(Before = [] , Beforel = [],!; % nothing was skipped 
append (Before, [' . '] , Beforel) 
), I. 

% nothing left to recover; write phrase that was skipped 
recoverrest ( [] ,_, Before, [] ,yes, Beforel,_, _,_,_) : - 

(Before = [] , Beforel - [] r I ; 

append (Before, [' . '] , Beforel) 

), 1. 

% can recover partial parse 

recoverrest (Bs , Before , Fmt , yes , Errors , Sect , Pmode , Examtype , _) : - 

checkst(sem_pattern,_,s, Target, Bs, Restseg) , % recover from symbol tab. 
%doresult (Target , Fmtl , Examtype , Sect , modeS , _) , 

formatresult (Target , mode 5 , Fmtl) , 
recovery ( 5 , Rest seg , Rest , _, Fmt2 , yes , Error2 , 

[] , [] , Sect, Pmode, Examtype, _) , 
append (Fmtl , Fmt 2 , Fmt) , 

(Before = [] , Errors = Error2 , 1; %nothing skipped to add '.'to 
append (Before, [' . 1 |Error2] , Errors) 
) . 

% cannot recover partial parse - skip first element and retry 
% if 1st element is a negation semantic type, skip 2nd element instead 
% Handles case where 1st element is a negation, certainty or status 

% add 2nd element to unparsed sentences list (enlcosed in angle brackets) . 

recoverrest ( [X , Y | Restseg] , Beforel , Fmt , yes , Errors , 

Sect , Pmode , Examtype , _) : - 
f oundword (X , Semi , Tar) , 

( member (Semi, [neg, certainty, vcertainty, vconn, status , vstatus] ) ; 
Semi = p, Tar = [_, conn] 

%(Mod = neg; Mod = certainty; Mod = status; Mod = vcertainty), % leave 
this mod in 

preprocess ( [X | Restseg] , FsegO,_,_, bpskip) , % skip undefined words 
findingseg(FsegO,Fseg,Before2) , !, % get finding seg 
(Fseg = [] , Errors = [X, Y | Restseg] , Fmt = [] ; % no finding 
preprocess (Fseg, Bseg, _, Restsem, bpskip) , % skip undefined words 
dosent ( Fseg , Bseg, Restsem, Fmtl , Message , Sect Examtype , 

modes, _), % try to parse finding segment 
recoverrest (Fseg, [Y|Before2] , Fmt2 , Message, Error2 , 

Sect , negmode , Examtype , _) , % recover remainder 
(Beforel = [] , Errors = Error2, !; 
append (Beforel, [. |Error2] , Errors) 
) , 

append (Fmtl, Fmt2 , Fmt) 
) . 

% skip 1st element; enclose it in brackets 
recoverrest ( [X | Restseg] , Beforel , Fmt , yes , Errors , 
Sect , Pmode , Examtype ,_) : - 
preprocess (Restseg, FsegO bpskip) , 
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findingseg(FsegO, Fseg, Bef ore2) , !, % get finding seg 
append (Bef orel, [x|Before2] , Before) , 

(Fseg = [] ; Errors = [X|Restseg], Fmt = [] ; % no finding 
preprocess (Fseg, Bseg, Restsem, bpskip) , 
dosent (Fseg , Bseg , Res t sem, Fmt 1 , Message , Sect , Examtype , 

mode5,_), % try to parse finding segment 
recoverres t (Fseg , Before , Fmt2 , Message , Errors , 

Sect ,Newmode, Examtype ,_) , % recover remainder 
append ( Fmt 1 , Fmt 2 , Fmt ) 
) . 

% no semantic information left; return Errors 
recoverrest ( [x|Restseg] , [] , Bef orel, Fmt, yes, [x|Restseg] , 
Sect, Pmode, Examtype ,_) . 

%dosent <+S, +Bs, +Semlist , -Fmtlist , +Message, ^Section, +WriteMessage, +Examtype, 

% +Mode) . 

% S is original list of words in sentence; Bs is list after lexical lookup 

% Semlist is list of semantic categories corresponding to Bs 

% Fmtlist is list of target forms for sentence 

% Message is ? yes* if the output from parser signals a failure, 

% and 7 no ' otherwise 

% Section is section of examination being processed 

% WriteMessage signals whether an error occurred in generating target ^ form 

% Examtype is the domain, and Mode is the user specified mode of parsing 

% Parse sentence and returns target in nested format 

% Handles case where sentence should be skipped because info is about 
% family member or peripheral to patient 
dosent (S,_, Semlist, [] , Error, _,_,_,_,_) :- 

skipsentence (S, Semlist, Error) , I . 
dosent ( S , Bs , Semi ist , Fmtlist , Errormsg , Section , Writef ail , Examtype , Mode , _) : - 
attemptparse (P, Bs , sentence , Semlist , Section, Atotal) , 

( P = [failure] , Errormsg = yes, Writef ail = no, 1 % parse failure 

p' = [] , Errormsg = no, Writefail = no, Fmtlist = [] , ! % empty target 

%doresult ( P , Fmt 1 ist , Examtype , Section , Mode , _J , 
formatresult (P, Mode, Fmtlist) , 
Errormsg = no, Writefail = no, ! 

7 

Errormsg = yes, Writefail = yes, ! 

) ■ 

%parse_sentences (Beg, Beg, [] , [] /_/_,_) = - • * 

% attemptparse (-P, +Bs , +Structure, +Semlist, -Ftype, -Total) 
% P is output from parser 

% Bs is list of words in sentence after lexical lookup 

% Structure is name of structure to be parsed 

% Semlist is list of semantic categories corresponding to elements m Bs 

% Total is number of times parser reached sem_sent in grammar; 

% where sem_sent is highest level predicate in grammar 

% don't parse if sentence consists of only or ';' 

attemptparse ([] / Bs ,_,_,_,__) : - 
Bs = ['.']; Bs = [';']. 



% if a template exists for whole sentence, get parse from it 
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attemptparse (P,Bs, sentence, _,_,_) :- 

Bs = [X, ' . ■] , is_list(X) , % the whole sentence is a finding 
f ind_sem_sent (P,X) , ! . 

% parses and retracts wellformed string table - parses sentence 
attemptparse (P, Bs , sentence, Semi ist, Ftype, Atotal) : - 

retractall (wfst (_,_,_,_/_/_) ) / 

retractall (addstotal (_) ) , 

sem_sent (P, Semlist, Atotal,Bs, [] ) , !. 

% parses and retracts wellformed string table - parses bodypart only 
attemptparse (P,Bs, bodypart, _,_,_) : - 

sem_bodyloc (P,Bs, [] ) , 

retractall (wfst (_/_,_/_/_/__) ) , I - 

%segmentandparse(+Sentences, -Fmtlist, -Failures, -Unsent , ^Section, +Mode, 
% +Examtype , +Sentno) 

% Sentences is list of sentence segments. 

% Fmtlist consists of the formatted output for the segments 

% Failures is the list of unparsed segments. 

% Unsent is the list of segments with undefined words. 

% Section is the section being processed, Mode is the user specified mode 

% Examtype is the domain and Sentno is the sentence id. 

segmentandparse (□,□,□,[] *_/_/_/_) : " • * 

segmentandparse (Sentences , Fmtlist , Failures , UnSent , Section, Mode , 
Examtype , S entno ) : - 
get_sentence (Sentences, S, Rest ) , !, ^sentence to segment 
preprocess (S , SI , _, Semlist , Mode) , 
{Mode = mode2, NewPmode = bpseg2, 
Mode = mode3, NewPmode = bpseg3, 
NewPmode = bps eg 
), 

( segmentl (SI, Segs, [] , seg) , 1, 

parse_sentences (Segs , Fmtl , Fails , Unl , Section , NewPmode , Examtype , 
Sentno , Sentno , 0 ) , I 
; segment2 (SI, Segs, [] , seg) , !, 

parse_sentences (Segs , Fmtl , Fails , Unl , Section, NewPmode, Examtype, 
Sentno , Sentno , 0 ) , ! 
; segment3 (SI, Segs, [] ,Negstatus, seg) , ! , 

parse_sentences (Segs , Fmtl , Fails , Unl , Section, NewPmode, Examtype, 
Sentno , Sentno , 0 ) , t 

} '% fails if cannot segment sentence; otherwise segments remainder 
segmentandparse (Rest , Fmt2 , Nexterrors , NextUns , Section , Mode , 

Examtype, Sentno) , 
append (Fmtl, Fmt2 , Fmtlist) , 
append (Unl , NextUns , UnSent ) , 
append (Fails, Nexterrors, Failures) , 1 . 
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%segmentl ( +S , - Segs , +Beg, +Message ) 

% S is list of words in sentence 

% Segs consists of sentence segments as separate sentences 

% Beg is list of words in sentence prior to the current portion of sentenc 

% Message is 'seg* if segmenting succeeded and 'noseg' otherwise 

segmentl ([],[] ,_,noseg) !. 

% segment sentence at connect phrase/word or at most conjunctions 
% if negation precedes, restore negation 
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segmentl ( [X | Rest] ,['.',' <eos> ' | Rem] , Beg, seg) : - 

\+ sem_endmark(Rest, [] ) , % don't segment if at end already 
foundword (X,Sem, Target) , % get semantic classification and target 
( X = nor, append ( [no] , Rest, Rem) % ok to segment at nor 
;X = without, append ( [no] , Rest, Rem) % ok to segment at without 
%;X = 1 : 1 , Rest = Rem 

; Sem = neg, Rest = [Next | Rest2] , % have negation; test word after 
foundword (Next, Sem2 , Target2) , % for connective - add back negation 
test forconn (Next, Sem2, Target 2) , Rem = [X|Rest2] 
; testforconn (X, Sem, Target) , Rest = Rem 
) • 

segmentl ( [X | Rest] , [X | Newrest] , Start , Seg) :- 

append (Start, [X] , Beg) , % part before segmentation 
segmentl (Rest ,Newrest, Beg, Seg) . 

testforconn (X, Sem, Target) :- 

( Sem = p, Target = [P,conn],P\= with % segment at connective prep 
; member (Sem, [vconn, vshow] ) % segment at these types of verbs 
; Sem = conj, \+ member (X, [and, or ? /', as] ) 
) . 

% segment at certain words - 
segment2 ([],[],[] ,noseg) :- ! . 

segment 2 (S,Segs, [],seg) :- 
seg2 (S,Rest, Segs) , 
\+ sem_endmark (Rest , [] ) , I . 
segment2 ( [X | Rest] , [X | Newrest] , [] , Seg) : - 

segment2 (Rest ,Newrest , [] , Seg) . 
seg2 ( [X | Rest] ,Rest, [' . ' , ' <eos>' | Rem] ) :- 

member (X, [which, that , until , where , when, while , who , 
» (', ') ' , between, whereby, after, before, prior, 
greater, ranging] ) , 
Rem = Rest, I . 

segment3 ([],[] ,_,_,noseg) :- !. 

% segment at conjunction - if negation preceded conjunction, add 
segment3 ( [X|Rest] , Rem,Beg,Negstatus, seg) :- 

\+ sem_endmark(Rest, []),!, % already at end of sentence 
seg3 { [X | Rest] , Rem, Beg, Negstatus , seg) , ! . 

seg3 ( [X | Rest] , Rem, Beg, Negstatus, seg) : - 
wdef (X, conj ,__) , 
member (X, [and, or,',']), 

(nonvar(Negstatus) , Rem = [ ' . ' , Negstatus | Rest] , ! Irestore negation 

,* Rem = [' . ' , *<eos>' |Rest] , ! 
) . 

seg3 { [X | Rest] , [X, ' . ' , , <eos>' |Rest] seg) :- 

f oundword (X, age) , !. 

seg3 ( [X | Rest] , [X | Newrest] , Start , Negstatus , Seg) : - 

{ nonvar (Negstatus) , !; % 1st neg already found - continue segmenting 
foundword (X, Sem, Target) , ! , 

( Target = no, Negstatus = X, ! ; 
Sem = neg, Negstatus = X, I ; 
Sem \= neg, Target \= no, I 

) ; 
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true, ! % word is undefined 
) , 

append (Start, [X] , Beg) , % part before segmentation 
segment3 (Rest ,Newrest, Beg, Negstatus , Seg) , ! . 

% for finding type classes - parse as a sentence 
whattoparse (Sem, P, Sent) : - 

member {Sem, [cfinding, pf inding, morph, disease, device, proc,mproc, descriptor] ) , 
attemptparse (P, Sent, sentence, [Sem] , impression, _) . 

% for bodyloc classes - parse as a bodyloc modifier 
whattoparse (Sem, P, Sent) : - 

member (Sem, [bodyloc, region, side, position] ) , 

attemptparse (P , Sent , bodypart ,_,_,_) - 
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file radrec.pl 
September 7, 1999 

fail an unknown predicate 
: -unknown (_, fail) . 

- op(900, fy, [ \+ , not , once] ) . % same priority and type as \+ 

- op(700, xfx, [\=,~=]). % same priority and type as = or == 

- dynamic (domain/1) . % domain being processed 

- dynamic (outputf orm/1) . % form of output (needed to distinguish 

% markup of text from formatting forms 

:- dynamic (currentsect/1) . % section for outputting results 

test_genome (Outf ile, Errf ile, Unfile) : - 

get_inputsents ( [] , Toklist) , !, % read in and tokenize input 
(Toklist = [], i, % error condition 
app_errl (_, Outf ile, 'No input sent')/ I 

parse_sentences (Toklist , Fmtlist, Failed, Undef,UnSent, impression, 
bp, genome, 0) , I , 

outputresults (Fmtlist, Failed, Errf ile, Undef , Unfile, UnSent, Outf ile, 
full , line , genome , 1 , 0 , _, exe , plain) 

) . 

outputresults ( FmtlistO , Failed, Errf ile , Undef , Unfile , UnSent , Outf ile , 

Amount , Type, Exam, Compno , DocComp , NewCompno , Caller, Protocol) 
tell (Outf ile) , 

(Protocol = sgml, ! , Op = sgml; 
Caller = server, ! , Op = sgml; 
Op = plain) , 

(Type = nested, ! , % original output form - nested findings 
write { ' <nested> ' ) ,new_line (Op) , 
write (Fmtlist) , new_line (Op) , write ( ' </nested> ' ) , 
new__line (Op) , I 

) , 

(Caller = server, 

writejnessage (Unf ile, Undef , Caller, 1 <undef ined> ' , '< /undef ined> ' ) 

7 

Caller - exe, Undef \= [] , 

write_message (Unfile, Undef, Caller, ****** Undefined Words *****',[]) 
%write_highlight ( [] , UnSent, Caller) 

true 
), 

(Caller = server, 
write ( * <noparse> 1 ) , ! , 

write_highlight (Undef , UnSent , Caller) , 

write^highlight ( [] , Failed, Caller) , write ( ' </noparse> 1 ) 

Caller = exe, Errf ile \= [] , Failed \= [] , 
tell (Errf ile) , 

write( ? ***** Sentences/Phrases Not Parsed *****'), nl, 
%write_highlight (Undef , UnSent , Caller) , 
write_highlight ( [] , Failed, Caller) 
/ 

true % no Errfile to write to 
) . 



% set_args : Process options 
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% Argument options 

% -p ProbFile (otherwise default is problem messages are not written to file) 

% -i Infile (if input is supplied by file and not standard input 

% -m Mode (default is bp; the 6 choices are bp, model - modes) 

% -o Outfile (if output should be file and not standard output) 

% -? Provide list of default arguments 

% -pr Protocol - sgml or plain (default is plain) 

% -u Undefs (otherwise default is - undefined messages are not written 
% to a file) 

set_args (Args , Mode , Infile , Outfile , Prbf ile , Undef , Protocol ) : - 
set_mode (Args , Mode) , set__amount (Args , Amount) , 
setjprotocol (Args , Protocol) , 

set_infile (Args, Infile) , set_outf ile (Args , Outfile) , 
set_prbfile (Args , Prbf ile) , set_undefs (Args , Undef ) . 

set_mode (Args , Mode ) : - 

(nextto ( ' -m' ,M,Args) ; nextto (m,M, Args) ) , !, 

modeis (M,Mode) , ! . 
set_mode <_,bp) . % default output type 

modeis (relax, mode2) :- !. 
modeis (strict , model) :- !. 
modeis (skip, mode4) :- !♦ 
modeis ( longest , mode3 ) : - ! . 
modeis (best , bp) :- !. 
modeis (model, model) :- !. 
modeis (mode2 ,mode2) :- !. 
modeis (mode3 , mode 3) :- !. 
modeis (mode4,mode4) :- 1. 
modeis (modeS , modeS) :- i. 

set_protocol (Args , Protocol) : - 

(nextto ( ' -pr ' , Protocol, Args) ; nextto ( 'pr ' , Protocol , Args) ) , 
member (Protocol, [sgml, plain] ) , ! . 
set_protocol (__, plain) . 
set_undef s (Args , Undefs) : - 

nextto C-u', Undefs, Args) ; nextto (u, Undef s , Args) , !. % undef file option 
set_undef s (_, [] ) . % default is no file of undef ineds created 

set_inf ile (Args , Infile) : - 

nonvar (Infile) , I; % Infile is set already 

nextto ( ' -i ' , Infile, Args) , ! ; 

nextto (i, Infile, Args) , ! . 
set^infile (_, user_input) . % default is standard input 

set_prbf ile (Args, Prbf ile) :- t 

nextto (' -p' , Prbf ile, Args) , !; nextto (p, Prbf ile , Args) , 1. % prob file opti 
set_prbf ile (__, [] ) . % default is no file of problems is created 

set_outf ile (Args , Outfile) : - 

nonvar (Outfile) , I; % Outfile is already set 

nextto ( '-o' , Outfile, Args) , I; nextto (o, Outfile, Args) , 1. % outfile option 
set_outfile(_,user_output) . % default is standard output 

new_line (sgml) :- write (' <br> ') , nl, 1. 
new_line (server) :- write {' <br> '), nl , !. 
new line(exe) :- nl . 
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new_line (plain) :- nl . 
writejraessage (_, [] , exe,_,_) 
write_message ([],_, exe,_, J 
write__message (_, [] , plain, _ # _ 
write_message ( [] , plain, _) : - ! . 

wr ite_message ( File , Contents , Caller , Begmsg , Endmsg) : - 
( member {Caller, [exe, plain] ) , tell (File) , ! 

true) , 

write (Begmsg) , new_line (Caller) , 

(Contents = [] ; writeJList (Contents , 1) , new_line (Caller) 
) , 

(Endmsg = [] , 1 ; 

write (Endmsg) , ! , new__line (Caller) 
) . 

sentend ( [X |J , Caller) : - 

member (X, [ ' . ' , ' / » , ' ? 1 ] ) , new_line (Caller) , ! . 



gettargets ([],[]) : - ! . 

gettargets ( [ignore | Rest] , [ignore | Rest] ) : - ! - % possibly ignore 
gettargets ( [Wl | Rest] , [Tl | Trest] ) : - 

foundword(Wl,_ / Tl) , % target for Wl 
gettargets (Rest, Trest) , ! . 
gettargets (W,W) . % not in lexicon 
isneg(X) :- 

intersect (X, [no, negative , deny, 'rule out'] ) . 

wr iteoutsent ( [Word | Rest] ) : - 

write ( ? ' 1 1 ) , write (Word) , write ( 1 ' • ' ) , ! , 
(Word = 1 ' 1 ' , write ( ' ' ' 1 ) , ! ; true) , 
(Rest \= [] , write (','), !, writeoutsent (Rest) , !; 
true) , ! . 



% This file contains predicates associated with SGML tags 
% nextTag(+L,Tag, -PreTag, -PostTag) is true if 
% L is the starting List 

% Tag is an SGML tag; it could be a variable or instantiated already 

% PreTag is portion of L preceding Tag 

% PostTag is portion of L following Tag 

nextTag (L, Tag, PreTag, PostTag) : - 

append (PreTag, ['<» ,Tag, »>» | PostTag] ,L) . 

% endTag (+L, +Tag, -Pre, -Post) is true if 
% L is the starting list 

% Tag is the SGML end tag 

% Pre is the portion of L preceding the end of tag 

% Post is the portion of L following the end of tag 

endTag (L, Tag, Pre, Post) : - 

append( [Pre, [ ' < ' , »/' ,Tag, '>■] ,Post] ,L) . 

% enclosedPart (+L, +Tag, -Enclosed) is true if 

% L is the starting List; it is assumed that L is portion of some 

% list that follows a begin tag - i.e. '<',Tag|L 

% Tag is the SGML tag 

% Enclosed is the portion of text enclosed in tag; not including 

% end tag. 

enclosedPart (L, Tag, Enclosed, Post) : - 
endTag (L, Tag, Enclosed, Post) . 
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% file useful.pl - lexical lookup and utility tools 
-unknown (_, fail) . 
-dynamic (sentence/1) . 

- op(900, fy, [not,once]). % same priority and type as \+ 

- op(700, xfx, [\=,~=]). % same priority and type as = or 
% useful.pl February 21, 1992 

% 

% preprocess(+S,+Bsl, -U, -Sem3,+Mode) : preprocesses sentence to 

% bracket lexical phrases and remove words/phrases in 

% special db of noise words (nosem in nsphrase.pl db) 

% S is original sentence 

% Bsl is preprocessed sentence 

% U is list of undefined words in sentence 

% Mode is mode of process - in skip mode undefined words are removed 

% from preprocessed sentence 

preprocess (SO, Bsl, U, Sem3,Mode) : - %cfnew 

checkbeg(SO,S) , % if beginning is 'A)' ignore 

checkphrase(S,Sl,Seml) , % bracket all phrases in phrasal lexicon first 
checklist { SI, Ul,Bs,Sem2, Mode) , % check that all words are in lexicon, remove 
non semantic 

checklist (Bs,U, Bsl ,Sem3, Mode) . % check for phrases after non-sem are removed 
%append (Semi , Sem2 , Semi ) , 
%append (Semi , Sem3 , Semlist) , 

%union(Ul,U2,U) . . , 

% found checks if word X is defined as a single word, or if X starts a de tinea 

% phrase 

f oundword (X) : - 

wdef(X,_,_), !. 
f oundword (X) : - 

semw (X, _,_,_) , I . 
%definition from tagged input 
f oundword (X) : - 

phr(X, _,_,_) , ! . 
f oundword ( [X | Rest] ) : - 
Rest \= [] , 
phrasal (X,_, [X | Rest] ,J , 1. 
% 3/99 added f oundword to search the new semact.pl lexicon 
% phrasal using semp was added to util.lp 
% found/2 returns semantic cat. of word 
f oundword (X, Sem) : - 
wdef (X,Sem,_) . 
f oundword (X, Sem) : - 

semw (X, Sem, _,_) . 
%definition from tagged input 
f oundword ( X , S em ) : - 

phr (X, Sem, [],_). 
f oundword ( [X | Rest] , Sem) : - 

phrasal (X, Sem, [x|Rest],_) . 
% found/3 returns semantic cat. and target form 
f oundword (X, Sem, Form) : - 

wdef (X, Sem, Form) . 
f oundword (X, Sem, Form) : - 

semw (X, Sem, Form, _) . 
%definition from tagged input 
f oundword (X, Sem, Form,_) : - 

phr (X, Sem, [] , Form) . 
f oundword ( [X | Rest] , Sem, Form) : - 



phrasal (X, Sem, [X|Rest] , Form) . 

%collectsem(+Word,-Sem) : Sem is the list of semantic classes corresponding 
% to Word 

collectsem (Word, Sem) : - 

setof (X, foundword(Word,X) , Sem) . 
% missing checks if a word present in a sentence is defined 
missing (X) :- 

member (X, S) , 

not foundword(X) . 

% checkbeg (+SO,-S) checks beginning of sentence; if it begins with a letter or 
% number followed by a ' ) ' , that part is skipped 
checkbeg ( [X, ')* [Rest] , Rest) :- I. 
checkbeg (X,X) . 

% checks every word in a list to see if it is defined; creates 
% a new list of words not defined, and a new list of sentence 
% where phrases are bracketed. 

checklist ([],[],□,□,_)• . n , 

% if X is a list it has already been identified as a phrase m phrasal lex 

checklist ( [X|Rest] ,Undef ,Newrest, Semi is t, Mode) :- 

is_list (X) , 

check no_sem( [XjRest] ,Restl,_) , 

checklist (Restl,Undef,Newrest, Semlist, Mode) , !. %is phrase part of nosem 
checklist ( [X | Rest] ,Undef, [x|Newrest] , Semi ist , Mode) :- 
%collectsem(X,Sem) , 
is_list(X), X= [Wl|Tail], 
phrasal {Wl, Sem, X,_) , 

checklist (Rest, Undef , Newrest , Sem2 , Mode) , ! , 

append ( [Sem] , Sem2 , Semi ist) . 
checklist ( [without | Rest] , Undef , Newrest, Semi ist, Mode) : - 

checklist ( [with, no | Rest] , Undef , Newrest, Semlist, Mode) . 
% this problem has to be fixed in preprocessor 
% check for a number with a ■ - "11,200" and fix it 
%checklist ( [X, ' , ' ,Y|Rest] , Undef , [N|Newrest] , [number | Semlist] ,Mode) :- 
% number (X), number (Y), N is X * 1000 + Y, », 
% checklist (Rest, Undef , Newrest, Semlist, Mode) , ! . 
% check for a literal number %cfnew 

checklist ( [X | Rest] , Undef, [x|Newrest] , [number | Semlist] ,Mode) :- 
number (X) , 

checklist (Rest , Undef , Newrest , Semlist , Mode) , I . 
% beginning of List is a prefix of a phrase that is a complete finding 
checklist (List, Undef , [Phrase | Newrest] , [cfinding | Semlist] ,Mode) :- 

check_sem_f inding ( Lis t , Rest , Phrase ) , 

checklist (Rest, Undef, Newrest, Semlist ,Mode) ( I . 
% beginning of List is a prefix of a phrase that is in nosemantic lexicon 
checklist (List , Undef , Newrest , Semlist , Mode) : - 

check__no_sem(List, Rest, Phrase) , 

checklist (Rest, Undef , Newrest, Semlist, Mode) , ! . 
% beginning of List is a prefix of a phrase that is in phrasal lexicon 
checklist (List, Undef , [Phrase | Newrest] , Semlist , Mode) :- 

get__longest__sem (List , Rest , Phrase, Sem) , 

%check_sem (List, Rest, Phrase, Sem) , %change to get longest phrase 

checklist (Rest, Undef , Newrest, Sem2, Mode) , I , 

append (Sem, Sem2, Semlist) . 
% beginning of List is a single word that is in semantic lexicon 
checklist ( [x|Rest] , Undef, [x| Newrest] , Semlist , Mode) :- 
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collectsem(X, Sem) , !, 
%f oundword (X , Sem) , ! , 

checklist (Rest, Undef ,Ne wrest , Sem2, Mode) , i, 
append (Sem, Sem2 , Semlist) . 
% beginning of List is an undefined word 
checklist ( [X | Rest] , Undef s,Nrest, Semi ist, Mode) :- 
checklist (Rest , Undef ,Newrest, Semlist, Mode) , 
(member (X, Undef ) , !; Undefs = [X| Undef], !), 
(Mode = skip, !, Nrest = Newrest; 
Mode = bpskip, I, Nrest = Newrest; 
Nrest = [X | Newrest] ) , !. 

% if beginning is a number followed by a . followed by a non number 
% skip; %cfnew 

checkphrase( [X, .] , [X, .] , [] ) :- i. 
checkphrase( [X, . ,z|Rest] ,Y,Semlist) :- 

number(X), not (number (Z) ) , checkphrase (Rest , Y, Semlist) , I . 
% beginning of List is a prefix of a phrase that is a complete finding 
% or a phrase in phrasal lexicon 
checkphrase (List, [Phrase | Newrest] , Semlist) :- 

(check_sem__f inding (List, Rest, Phrase) , Sem = [cf inding] ; 

get_longes t_sem (List , Rest , Phrase , Sem) 

), !, 

%check_sem (Lis t , Rest , Phrase , Sem) ) , ! , 

checkphrase (Rest , Newrest , Sem2) , I , 

append (Sem, Sem2, Semlist) . 
checkphrase ( [WjRest] , [W| Newrest] , Semlist) :- 

checkphrase (Rest , Newrest , Semlist) . 
checkphrase ([],[],□). 

check_sem__f inding ( [w|Tail] ,Tail,W) :- 

W = [Wl|Rest], % W is bracketed already 

sem_f inding_sent (Wl, W,_) . 
check_sem_f inding ( [W | Tail] , Sf inal , Phrase) : - 

sem_f inding_sent (W, Phrase, _) , 

begsublist (Phrase, [W | Tail] , Sf inal) , I. 
sem_f inding_sent (_,_,_) fail. 

% check_no_sem(+Sent, -Rest, -Phrase) : removes Phrase from Sent resultxng 
% in Rest if Sent begins with a phrase in nosem (non-semantic list) . 
check_no_sem ( [W | Tail] , Sf inal , Phrase) : - 

nosem (w, Phrase) , %phrase beg. with W that should be removed 

begsublist (Phrase, [W|Tail] ,S1) , 

remove_comma(Sl,Sfinal) , !. % remove if it is next 

%get_longest_sem(+Sent, -Rest, -Phrase, -Sem) : Phrase is longest phrase that is 
% a prefix of Sent; Rest is remainder and Sem is list of semantic classes 
get__longest_sem (Sent, Rest, Phrase, [Sem]) :- 

setof (X,check_sem(Sent,X) ,L) , % set of Phrases 
maxphrase(L, [] , Phrase, 0) , % Phrase with maximum length 

append (Phrase, Rest, Sent) , % rest of sentence after Phrase 

f oundword (Phrase, Sem) . 

% check_sem(+Sent, -Rest, -Phrase, -Sem) : checks if phrase beginning with 

% Sent is in phrasal lexicon; Rest is the remainder of Sent after phrase 

% Sem is the semantic class 

check_sem( [w|Tail] , Rest , Phrase, Sem) :- 

phrasal (W, Sem, Phrase ,_) , 

begsublist (Phrase, [w|Tail] ,Rest) . 
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% this also obtains the Target form 

check_sem( [W|Tail] , Rest , Phrase, Sem, Target) :- 

phrasal (W, Sem, Phrase, Target) , 

begsublist (Phrase, [w|Tail] .Rest) . 
check_sem{ [W|Tail] , Tail, W, Sem) :- 

is_list(W), %enclosed in brackets means it is a phrase 

W = [Wl [Rest] , 

phrasal (Wl, Sem, W,_) , 1 . 
check_sem( [W|Tail] , Tail ,W, Sem, Target ) : - 

is_list (W) , %enclosed in brackets means it is a phrase 

W = [Wl|Rest] , 

phrasal (Wl, Sem, W, Target) , 1 . 
% check__sem (+Sentence, -Phrase) is similar to check_sem/4 except for fewer args 
check_sem{ Sentence, Phrase) : - 

check_sem (Sentence ,_, Phrase ,_) . 



% file util.pl 

%%%%%%%%%%%%%%%% Utility Predicates 

% fail an unknown predicate 
: -unknown (_, fail) . 
:- op (900, fy, [not; once]). 
:- op{700, xfx, [\=,~=] ) . 

:- dynamic (wf st /6) . 
:- dynamic (addstotal/l) . 
: - dynamic (paragno/l) . 
:- dynamic (sectno/l) . 
:- dynamic (phr/4) . 

% wfst (+Rule,+Number,+Res,+Fmt,+SO,+S) : well-formed symbol tabl 

% Rule is the name of rule; Number is the option number 

% Res is s for success and f for failure 

% Fmt is the format (for successes) ; for failure Fmt is [] 

% SO is the sentence position at the start of Rule 

% S is the sentence position when Rule has been completed 
% add to wfst 



% same priority and type as \+ 
% same priority and type as = or == 



addst (Rule , Number, Res, Fmt, SO, S) 

\+ checkst(Rule,Number / Res / Fmt,SO / S) , %result for rule was saved already 
\+ checkst(Rule, Number,!, Fmt, S0,S), % result from different rule saved 
( checkstfRule^Res^m^SO^) , % different rule produced same result 

assert (wfst (Rule , Number , i, Fmt, SO, S) ) ; 
assert (wfst (Rule dumber, Res, Fmt, SO, S) ) ) , t . 
addst (_,_,_,_,_,_) : " !. % always succeed 

% checkst ( + Rule, -Number, -Res, -Fmt, +S0, -S) : checks to see if rule has been saved 
% in wfst 

checkst (Rule , Number , Res , Fmt , SO , S ) : - 
wfst (Rule^umbe^ReS/FmtfSO^) . 

% beglist(L,Y) - is Y the head of list L 
beglist ( [X|_] ,Y) :- X = Y , ! . 

% splice (+L1, -L2) : LI is a list of lists; L2 is merged list 
splice (LI, L2) :- append (LI, L2) , i. 
%splice( [],[]) :- ! . 
%splice ([[]], []) !. 
%splice( [X] ,X) :- !. 

%splice( [ [] | LI] ,L2) :- splice (L1,L2) , ! . 

%splice([[[]] |L1],L2) :- splice (Ll f L2) , ! . 

%splice ( [X | [[]]], L) : - splice (X, L) , i . 

%splice( [L1,L2] ,L3) :- 

% append (LI , L2 , L3 ) , i . 

%splice( [X|L1] ,L2) :- 

% splice (LI, L3) , 

% append (X,L3,L2) , !. 

%splicerel - works with relations which have Argl , . . . , Argn. 
% It splices a Splicelist in each arg of relation 

splicerel (Finding, Splicelist , Spliced) : - 
splice (Splicelist , Spl) , 

(Finding = [rel,X|Rest] , spliceargs (Rest , Spl , Sp) , 
%splice ( [ [rel , X] , Sp] , Spliced) , ! ; 
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append ( [rel,X] , Sp, Spliced) , ! ; 

%splice { [Finding, Spl] , Spliced) ) . 

append (Finding, Spl, Spliced) ). 
%spliceargs - Splices a list into each element of a list 
spliceargs ([],_,[]) : - I . 

spliceargs ( [Argl |Rest] , Splicelist , Spliced) :- 

%splice ( [Argl, Splicelist] , Sargl) , 

append (Argl, Splicelist ,Sargl) , 

spliceargs (Rest , Splicelist , Srest) , 

%splice ( [ [Sargl] , Srest] , Spliced) . 

append ( [Sargl] , Srest , Spliced) . 
list( [] , [] ) . 
list ([X | []] ,X) . 

list ( [X|L1] ,L2) :- list (L1,L3) , 

append ( [X] ,L3,L2) , ! . 

% strip (LI ,L2) removes extra square brackets from L 
strip ( [L] ,L) . 

% B is a suffix of A and C is the difference 
difflist (A,B,C) :- append(C,B,A) . 

% S is a sublist at beg. of L if there is a list Rest, which when appended 
% to S results in L. 

begsublist (S,L,Rest) :- append (S , Rest , L) , !. 

% checks that first element in list S has semantic category m Semi is t 
f irstword ( [Wl | _J , Semlist ) : - 

atom(Wl), wdef (Wl,Sem,_) , % semantic category 

member (Sem, Semlist) . 
f irstword ( [Wl |_] , Semlist) : - 

is_list (Wl) , phrasal (Wl,Sem,_,_) , 

member (Sem, Semlist) . 
% removes phrases from first arg that are in nsphrase - lexicon of non-sem. 

phrases 

remove_no_sem ( [],[]) :- I . 
remove_no_sem { [W| Tail] , Sf inal) : - 

nosem(W, Phrase) , %phrase beg. with W 

begsublist (Phrase, [W | Tail] , SI) , Remove from sentence 

remove__comma(Sl,S2) , %remove " , " if it is next 

remove_no_sem (S2 , S final) , I . 
remove_no_sem ( [w| Tail] , Sf inal) : - 

remove_no_sem (Tail , SI) , 

append ( [W] ,Sl,Sfinal) , I. 
remove_comma ( [ ' , 1 | Tail] , Tail) . 
remove_comma { S , S ) . 

% remove_sem(+Sent, -NewSent) : Sent is the original sentence, NewSent is 
% stripped of all phrases that are defined in lexicon 
remove_sem( [],[]) :- ! - 
remove_sem ( S , NewS ) : - 

check_sem{S,Rest,_,_) , % phrase in sent, is in lexicon - remove it 

remove_sem (Rest, NewS) , i . 
r emo ve_s em ( S , NewS ) : - 

check_no_sem(S / Rest,_) , % phrase in sent, is in nosem list - remove 

remove_sem (Rest, NewS) , ! . 
remove_sem ( [X | Tail] , [X | NewS] ) : - 

remove_sem (Tail, NewS) , !. % not a phrase, process rest 
% remove_words (+Sent, -NewSent) : Sent is the original sentence, NewSent 
% is stripped of all words that are in lexicon 
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remove_words ([],[]) : - ! - 
remove_words ( [X j Rest] , NewRest ) : - 

( (foundword(X) ; number (X) ) , % X is defined in lexicon 
remove_words (Res t , NewRest ) , ! ; 

remove_words (Rest, New) , NewRest = [X|New], ! % X is not in lexicon 

%maxphrase (+Listof Phrases , +Maxin, -MaxOut, InitMaxLen) is true if 
% Listof Phrase is a list of multi-word phrases, 
% Maxin is phrase with maximum words so far 

% MaxOut is phrase with maximum length of phrases in Listof Phrases 

% InitMaxLen is length of initial phrase which is of max. length 

maxphrase([] , Maxin, Maxin, J : - ! . % no more phrases - maximum is same as maxxn 
maxphrase ( [P | Rest] , Maxin, Maxout , InitMaxLen) : - 

length (P, Len) , % length of first phrase 

( Len > InitMaxLen, !, maxphrase (Rest , P, Maxout , Len) ; 

Len < InitMaxLen, I, maxphrase (Rest , Maxin, Maxout , InitMaxLen) 

) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%% lexical interface predicates %%%%%%%%%%%%%%%%%%%%%% 
%acclex (Sem, W, SO, S) : - 

% output form(htext) , !, acclexl (Sem, W, SO , S) . 
acclex(Sem,W,SO,S) :- 

acclex2 (Sem, W,S0,S) . 
acclex ( Sem, W, SO , S) : - 

acclexss (Sem, Syn, Target, Features, SO, S) . 
% check lexicon for word or phrase, Target form is original W 
acclexl (p, [P,C] , [w|Rest] ,Rest) :- 
is_list (W) , 

f ind_sem_phrase (p, [P,C] ,W) . 
acclexl (p, [P,C] , [W|S] ,S) :- atom(W), 

wdef (W,p, [P,C] ) . 
acclexl (Sem, [W] , [w|Rest] ,Rest) 

isJList(W), %if bracketed list, get Sem and Code from phrasal lexicon 

f ind_sem_phrase (Sem,_,W) . 

acclexl (Sem, W, [W|S] ,S) :- atom(W) , 

wdef (W,Sem,__) . 

% check lexicon for word or phrase, Target form is taken from lexicon 
%acclex2 (Sem, Code, [WjRest] , Rest) :- 

% is_list(W), %if bracketed list, get Sem and Code from phrasal lexicon 

% f ind_sem_jphrase (Sem, Code, W) . 

acclex2 (Sem, Code, [W|S] ,S) :- foundword (W, Sem, Code) , 

nonvar (Code) . % protect against 

lex. error 

% find a phrase [w|Tail] in lexicon that begins with W and has category Sem 
f ind_sem_phrase (Sem, Code, [w| Tail] ) : - 

phrasal (W, Sem, [W | Tail] , Code) , % phrase and code beg. with W 

nonvar (Code) . 

% case where phrase is already bracketed, look up phrase 
sem_finding_jphrasel (Code, [wjTail] ,Tail) :- 

is_list(W), %phrase is bracketed 

f ind_sem__sent (Code, W) , 
nonvar (Code) . %protect against lexical error 
% case where phrase is already bracketed, look up phrase 
sem_f inding_phrase2 (Code, [WjTail] ,Tail) :- 

is_list(W), %phrase is bracketed 
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f ind_sem_sent (Code, W) , 
nonvar (Code) . %protect against lexical error 
% Phrasal succeeds if lexicon contains phrase 
phrasal (Wl, Sem, Phrase, Code) :- 

phrase (Wl, Sem, Phrase, Code, _) . %multi- word phrase in lexicon 

% added Marchl5, 1999 

phrasal {Wl, Sem, Phrase, Code) : - 

semp (Wl, Sem, Phrase, Code, Features) . 
% lexical definition from marked up input 
phrasal (Wl, Sem, [Wl | Tail] , Code) :- 

phr(Wl, Sem, Tail, Code) . 
acclexss (Sem, Syn, Target, Features, [W|S],S) :- 

atom(W) , 

semw(W, Sem, Target, Features) , 
synw (W, Synclass) , 
member (Synclass, Syn) . 
acclexss (Sem, Syn, Target, Features, [W|S],S) :- 
is_list (W) , 

findjphrasess (W, Sem, Syn, Target , Features) . 
find_phrasess ( [Wl |Tail] , Sem, Syn, Target , Features) :- 
semp (Wl, Sem, [Wl|Tail] , Target , Features) , 
synp (Wl, [Wl | Tail] , Synclass) , 
member (Synclass, Syn) . 

% lexical definition of a complete finding 
f ind__sem_sent (Code , [W | Tail] ) : - 

sem_f inding_sent (W, [W|Tail] ,Code) . 

listify(C, [C] ) :- 

atom(C) , 1 . 
listify(C,C) :- 

is_list(C), !. 

% distributes left mods and right mods over list of findings creating 
% list of lists of findings with mods 
distributemods ([],[] ,_,_,_) :- !. 

distributemods (Dist, [Dl|Tail] , Lmods , Rmods , Type) :- 

distributemods (Dist2, Tail , Lmods , Rmods , Type ) , %distributed for remainder 
mergemods (Lmods , Rmods , Al lmods ) , 

frame (D, Type, Dl,Allmods) , %Type frame with mods 

append ( [D] ,Dist2, Dist) . % Combine findings to get list of findings 

% fixconj - if Leftmods has [certainty, no] , and Conj = or, change Conj to and. 
% no A or B = no A and no B; 'denies A, B, or C* is similar, 

fixconj (Leftmods , Conj , [rel , and] ) : - 

(member ( [certainty, no] , Leftmods) ; member ( [certainty , deny] , Leftmods) ) , 

Conj = [rel, or] . 
fixconj (_, Conj , Conj) . 

% write_sentences/l inputs a PROLOG list and prints out lines 

% which which are English sentences. No wrapping is done. 

write_sentences ( [] ) : - ! . 

write__sentences ( [X] ) :- write (X) , nl. % special sentence - section name 
wr ite_sentences (['^c'/P, 1 / 1 / 1 ^]) : - 

write ( 1 <p/> ' ) , nl . % paragraph mark 
write_sentences ( [X | Rest] ) : - 

upper__first ( [x|Rest] , [U|Rest]), 
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write (U) , % First letter of first word made upper case 
%write (X) , 

(X - U, chkforpunct (U,Rest) , !, write_terms (Rest) ; % no space needed 
write { ' ' ) , write_terms (Rest) 
) . 

% write_sentence/2 inputs a PROLOG list and prints out an English 

% sentence wrapped. Idlen is the starting position of the sentence 

% in the output . 

% uses libraries ctypes, basic, not 

write_sentence ( [X | Rest] , Idlen) : - 

upper_f irst ( [x|Rest] , [U|Rest] ) , 

write (U) , 

name {U, LU) , length (LU, L) , 

(U = X, chkforpunct (U, Rest) , !, writejterms (Rest , L+Idlen) ; 
write (' '), write_terms (Rest, L+Idlen+1) 
) . 

% write_list inputs a PROLOG list and prints out a sentence like list. 

% wrapped. Idlen is the starting position of the list in the output. 

write_list ( [X | Rest] , Idlen) : - 
write (X) , 

name(X,LU) , length (LU, L) , 

( chkforpunct (X, Rest) , write_terms (Rest , L+Idlen) , I; 
write ( 1 1 ) , write__terms (Rest, L+Idlen+1) ) . 
%write_list (+List , +Idlen / -Idlenout) 

% write_list prints out a sentence like list with wrapping if necessary. 

% List is the list to be printed 

% Idlen is the column position at start 

% Idlenout is the column position at end 

write__list ( [] ,Len,Len) :- !. 

write_list ( [X | Rest] , Idlen, Idlenout) : - 

atomic (X) , write (X) , 

name (X, LU) , length (LU, L) , 
(L + Idlen > 74, nl , Idlen2 =1, !; 
Idlen2 = L + Idlen, 1 

(chkforpunct (X,Rest) , write_list (Rest , Idlen2 , Idlenout) , 
write (■ *), write_list (Rest, L+Idlen2+1 # Idlenout) , 

^sJListtX), write__list (X, Idlen, Idlen2) , write_list (Rest, Idlen2 , Idlenout) . 

upper_f irst ( [X | Rest] , [U | Rest] ) : - 
name(X, [L|Z]), 
(is_alpha(L) , Up is L - 32, ! ; Up = L) , 
name (U, [Up | Z] ) , 1 . 

% write_terms/l writes out a word followed by blank, except for punctuations. 

write__terms ( [] ) : - I . 

% case where X is end of sentence 

write_terms ( [X | Rest] ) : - 

(X = 1 . ' ; X = 1 ; ' ) , % last word of sentence 

write (X) , nl, ! , write_sentences (Rest) , I . 
% case where X is interior of sentence 
write_terms ( [X | Rest] ) : - 
write (X) , 

(chkforpunct (X,Rest) , write_terms (Rest) ; 
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write ( ' ' ) , write_terms (Rest) 

% write_terms (List, Used) : writes the terms in list and counts the number 
% of columns used; starts new line if 75 columns have been used 

write_terms ([],_) :- ! . 
% at end of list 

write_terms ( [ . ] , _) : - write ( 1 . 1 ) , nl , ! . 
write_terms ([;],_) : - write (';'), nl,!. 
% X is a punctuation, don't add to final count 
write__terms { [X | R] , Used) : - 

( R = [] , write {' '), write (X) , I; 

chkforpunct (X,R) , 

write (X) , write_terms (R,Used) , ! 

) . 

% X is last word in sentence 
write_terms ( [X, . ] , Used) : - 

name(X, List), length (List, Len) , 

Need is Len + 2, 

Total is Used + Need, 

(Total =< 75, write ( ! 1 ), write (X) , write (.) ; 
Total > 75, nl, write { 1 '), write (X), write (.)) , 
nl, I . 

% X is last word in sentence 
write_terms ( [X, ; ] , Used) : - 

name(X, List), length (List, Len), 

Need is Len + 2, 

Total is Used + Need, 

(Total =< 75, write (' ! ),write(X), write (';') ; 
Total > 75, nl, write ( 1 '), write (X), write (.)) , 

nl, !. 
% X is followed by 1 , ? 
write_terms ( [X, ' , ' [Rest] , Used) : - 

name(X, List), length (List, Len), 

Need is Len + 2, 

Total is Used + Need, 

(Total =< 75, write (' 1 ), write (X), write (', 1 ) , 
write_terms (Rest, Total) ; 

Total > 75, nl, write ( * 1 ), write (X) , write (',') , 
New is Need - 1, write_terms (Rest , New)), 
i . 

% writes blank + name of X, used is length of name+1 
write_terms ( [X | Rest] , Used) : - 

name(X, List), length (List, Len), 

Need is Len + 1, 

Total is Used + Need, 

(Total =< 75, write (• '), write (X) , write_terms (Rest , Total); 
Total > 75, nl, write ( 1 f ),write(X), write_terms (Rest , Len)),!. 
write_terms ( [ ■ X ' • s ' | Rest] , Used) : - 
name(X, List), length(List, Len), 
Need is Len + 3, 
Total is Used + Need, 

(Total =< 75, write (' '), write (X) , write ( " * s" ) , 
write___terms (Rest , Total) ; 

Total > 75, nl, write (X), write_terms (Rest , Len)),!. 
% processes sentences in Infile; writes formats to Outfile 
% sentences beginning with '%' are treated as comments 
testsents { Infile, Outfile) : - 
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see (Inf ile) , seen, see (Inf ile) , 

tell (Outfile) , 

readtests, 

see (Inf ile), seen, told. 
% reads next sentence and processes it 
readtests : - 

read_in (X) , 
(X = end_of_file, I; 
X = [eoff , ' . •] , ! ; 
X = [ ' ' ] , ! ; 

X = ['%'!_], I, readtests; % don't process comments 
preprocess (X,Bs,Undef , Semlist , skip) , 
( Undef = [] , 

dosent {X , Bs , Semi is t , Fmt , Message , impression, W , ches txray , strict ,0), 
write_sentence (X, 1) , write (Bs) , nl, 
write (Fmt) , nl; 

Undef \= [] , write_sentence(X,l) , write(Bs), nl, write (Undef ) , nl) , 
readtests % read next sentence 

% Reads in all sentences from input file and creates one list of all sentences 
get_inputsents (Prevlist , Toklist) : - 
read__in (X) , 

(X = end_of_file, Toklist = Prevlist, I; 
X = [eoff, 1 .'], Toklist = Prevlist, !; 
X = ['*], Toklist = Prevlist, !; 

(last('\X), append (Toklist, [' '] ,X) , I; %remove 

append (Prevlist, X,Newlist} , 

get__inputsents (Newlist , Toklist) 
)) . 

%get_sentence (+A, -B, -C) 

% Gets next sentence from input list containing all sentences read m 

% Don't end a sentence if " . " is preceded by a number and followed by 

% a number and unit measure - 1.25 cm, 1.5 cm, .5 cm 

% or is followed by a " . " which is part of abbreviation 

% get_sentence(A,B,C) - A is list of all sentences in report 

% - B is list containing one sentence 

% - C is remainder excluding B 

% sgml tag for multi-word phrase containing ' . ' that is not end of sentence 
get_sentence ( [■<• ,phr | Tail] , Sentence, LRest) :- 

enclosedPartCTai^ph^Betwee^Rem) , % Between beg. part of open pnr and 

close tag of phr 

append ( [sem, =, ' " ' , Sem, ' " ' ] , MoreAt tributes , Between) , %Sem is value o£ sem 
attribute 

(MoreAttributes = [ 1 > ' | Phrase] , TargetList = Phrase, !; 
MoreAttributes = [t , = , » » * | Targe tPlus] , % Target terms plus end of phr 
append(TargetList,['"' / '>'|Phrase],TargetPlus), ! % t attribute followed 

by actual phrase 
) , 

Phrase = [WljRest], 

append (Phrase, SRest, Sentence) , 

concat_atom (TargetList, Target) , 

assert (phr (Wl,Sem, Rest, Target)) , % assert lex def according to input 
%Phrase = [Wl | PRest] , 
%abbrev(Wl, [Wl | PRest] , Target, _) , 
get^sentence (Rem, SRest , LRest) , I . 



% Ignore sentence starting with ' % 1 , get next sentence 
get_sentence ( [ ' % 1 , ' % » | Rest] , Sent , Remainder) : - 
get_sentence (Rest, _, Rem) , 
get_sentence (Rem, Sent , Remainder) . 
get_sentence([X, .,Y,Z|Rest], [X,.], [Y # z|Rest]) :- % break up "140. 3 + " 

number (X) , number (Y) , Z = ' + ', 1- % Y belongs to ' + ' for new sentence 
get_sentence ( [X, . ,Y,z|Rest] , [N|SRest] , LRest) % 1 . 5 cm 

number (X) , number (Y) , 
% (wdef (Z,unit,_) ; Z = x) , 
Z \= '+', % break up "140. 3+" 
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name(X,Dl), name(.,D2), name(Y,D3), name ( 1 E+00 ' , D4) , 
append ( [Dl, D2 , D3 , D4] ,D) , name(N,D), % put number together 
get_sentence ( [zjRest] , SRest , LRest) . 
% common abbrev 

get_sentence ( [X, . | Rest] , [X | SRest] , LRest) :- % abbrev ending m "." 

% list of common abbreviations seen in reports should not end sentence 

member (X, [vs , dr , cm, mg] ) , get_sentence (Rest , SRest , LRest) , I. 
% list of start of names in reports should not end sentence 
get^sentence ( [X, . |Rest] , [X| SRest] , LRest) % abbrev ending in "." 

member (X, [ms , mr ,mrs , dr , st] ) , 

skipname (Rest r Rest 0) , % skip name part 

get_sentence (RestO r SRest , LRest) , ! . 
% more known abbreviations 

get_sentence ( [Wl | Rest] , [Rep | SRest] , LRest) : - 

abbrevchk ( [Wl |Rest] ^^Rem^ep) , % abbreviation 

get_sentence (Rem, SRest , LRest) , ! . 
% possible simple xml tag for new paragraph 

get_sentence ( [ ' < ' , p , ' / 1 , ' > 1 | Rest ] , Sent , Rem) : - %skip paragraph marker 

get_sentence (Rest , Sent , Rem) , I . 
% xml tag for sentence 1 <s> r 

get_sentence ( [ ! <' ,s, '> f |Tail] , Sentence, Rest) :- 
enclosedPart (Tail, s, Sent, Rest) , 

(last ( 1 . ' , Sent) , Sentence = Sent, t ; %already has ' . ' 
append (Sent, [.] f Sentence) 
) , i . %add ! . • 

get_sentence ( [ . | Rest] , [ . ] , Rest) : - I . %end of a sentence 
get_sentence ( [ ; | Rest] , [ ; ] , Rest) : - ! . 
% interior of sentence 

get_sentence ( [X | Rest] , [X | SRest] , LRest) : - 

get_sentence (Rest, SRest, LRest) . 
get_sentence ( [],[],[]). % no more sentences 

% abbrevchk (+WordList , -AbList , -RemList , -Target) is true if an abbrev is prefix 
% of WordList, RemList is suffix of WordList (excluding prefix) , 
% AbList is prefix consisting of abbreviation 
% and Target is target form of abbreviation 
abbrevchk ( [WljRest] , AbList , RemList , Target) :- 

abbrev (Wl , AbList, Target, Dom) , % abbrev knowledge base indexed by 1st word 

append (AbList, Rem, [WljRest]), % remainder of abbrev. must be in sentence 

(Dom = general, ! ; % abbrev. applies to all domains 
domain (Thisrep) , Dom = Thisrep, !; % abbrev. applies to this domain 
isJList (Dom) , member (Thisrep, Dom) % this domain in abbrev. list 

) , 

( % add back * . 1 to sentence if it also signals end of sentence 
Rem = [] , last( ' . 1 , AbList) , RemList =['.'], I %no more words 
; % words that generally start a new sentence 
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Rem = [W2|_], last ( ' . ' , AbList) , member (W2 , [his , her , he , she , the , this] ) , 
RemList = [ ' . 1 | Rem] , I 

% don't add ' . ' back 
RemList = Rem 

% skipname (+Beglist, -Endlist) : skips next word after "mr" or "st" 
skipname ([],[]) : - ! * 

skipname( [_, ' ' ' ' ,s|Rest] ,Rest) :- 1. %"Luke's" 
skipname ( [o, ■'' 1 ,_|Rest] , Rest) :- I. % "O'Grady 
skipname ( [_| Rest] , Rest) :- t. 

%get_section(+Toklist, -Sents, -Rest, -Section, - Print name , Addno) 
% Toklist contains input list; 1st sentence should be a header; 
% Sents are all sentences in section; Section is name of section 
% Sentences at beg. of Toklist are ignored until a section header is found 
g et_section( [t| Toklist] , Sents , Rest , Section, Printname, Addno) :- 
% first sentence should be section header 
get_sentence ( [T| Toklist] , Sentence, RToklist) , 

(section_header {Sentence, Rsent, Section, Printname) , % Sentence is a section 

header 

append (Rsent , RToklist , RToklist2 ) , 
get_sectionsents (RToklist2 , Sents , Rest) , 

(Addno = 0, I; % testing if input begins with section header 
Addno =1, t , sectno(Sectno) , Newno is Sectno + 1, 
retractall (sectno (_) ) , assert (sectno (Newno) ) 

retractall (paragno (J), assert (paragno (1) ) , %lst parag. of section 
retractall (sentno (J) , assert (sentno (0) ) %lst sentence of parag. 
% 1st sentence is not a legitimate header - return [] 
Section = [] 

% get_section (RToklist, Sents, Rest, Section) % skip till find header 
), ' • 



get_section{ [],[],[], [] #_/_) * 
get_sectionsents ([],□,□) : - ! . 
get_sectionsents (Toklist , Slist , Rest) : - 

get_sentence (Toklist, Sentence, RToklist) , % one sentence 
(\+ section_header (Sentence ,_,_,_) / %more sentences in secti 
get_sectionsents (RToklist , RSents , Rest) , 
append (Sentence , RSents , Slist ) 

; % the next section is a section header - return 
Rest = Toklist, Slist - [] ) . 

section_header(S, Rests, 'report clinical information item', 
1 CLINICAL INFORMATION: . ' ) : - 
(S = [clinical, information, ' : 1 , 1 . ' ] , ■ , Rests = [] ; 
begsublist( [clinical , information, ' : *] ,S, Rests) , I; 
S = [clininf o, ' : ■ , 1 . 1 ] , Rests =[],!; 
begsublist ( [clininf o, ' : * ] , S, Rests) , ! 
) . 

section_header(S, Rests, 'report impression item', 
' IMPRESSION: . ' ) : - 
(S = [impression, 1 :',.] , Rests = [] , !; 
begsublist ( [impression, r : 1 ] , S, Rests) , 1 
) . 

sectionjtieader (S,Rest, 'report summary item' , 1 SUMMARY : . f ) :- 
S = [summary, ' : ' |Rest] . 



si 

section_header(S # Rests, 'report description item' , ? DESCRIPTION : . ■ ) :- 
(S = [description,*:',.], Rests = [] , !; 
begsublist ( [description, » : ' ] # S, Rests) , t 

sectionJb.eader(S,Rest, 'report diagnosis item' , ' DISCHARGE DIAGNOSIS: . ') :- 
(S = [discharge, diagnosis, ' : 1 [Rest] ; 
S = [final, diagnosis, 1 :' |Rest] ; 

S = [principle, diagnosis, 1 :' |Rest] ; S = [associated, diagnosis | Rest] ; 
S = [transfer, diagnosis, ' : ' (Rest] ; 
S = [diagnosis, '(' ,es, ')',':' |Rest] ; 
S = [diagnosis, : | Rest] 
) / i • 

section_header(S, Rest, 'report laboratory data item' , ' LAB DATA:.') :- 

S = [laboratory, data, ' : ' |Rest] ; I . 
sect ion_header { S , Rest , 1 report medications item ' , ' MEDICATIONS : . ' ) : - 

S = [medications, ' : ' |Rest] , ! . 
section_header (S,Rest, 'report current medications item' , 'MEDICATIONS: . ') :- 

S = [current, medications, 1 : ' |Rest] , I . 
section_header(S, Rest, 'report discharge medications item', 
' DISCHARGE MEDICATIONS : . ' ) : - 
S = [discharge, medications, 1 : ' |Rest] , I . 
section_header(S, Rest, 'report discharge disposition item' , 
•DISCHARGE DISPOSITION:.') :- 
S = [discharge, disposition, ' : ' |Rest] # ! . 
section_header(S, Rest, 'report medications on admission item', 
' MEDICATIONS : . ' ) : - 
S = [medications, on, admission, ':' |Rest] , !. 
section_header(S, Rest, 'report medications on transfer iterm' , 
' MEDICATIONS : . ' ) :- 

S = [medications, on, transfer, ':' |Rest] , 1. 
section_header (S,Rest, 'report procedure item' , 'PROCEDURE: . ') 
(S = [operation, ':' |Rest] ; S = [procedure ,':' | Rest] 
), ' * 

section_header (S, Rest, 'report indications for procedure item' , 1 INDICATIONS : 

(S = [indications, for ,procedure, ':' |Rest] ; S = 
[indications, for, operation, * : ' [Rest] 
>, 



sectionJieader(S, Rest, 'report preoperative diagnosis item','PREOP DIAGNOSIS:.') 

S = [preoperative, diagnosis, ' : ' |Rest] , I . 
section_header(S, Rest, 'report admitting diagnosis item' ,' ADMITTING 

DIAGNOSIS: . ' ) :- 

S = [admitting, diagnosis, ' : ' | Rest] , ! . 
section_header (S,Rest, 'report postoperative diagnosis item' , 'DIAGNOSIS: . ') :- 

S = [postoperative, diagnosis, ' : ' |Rest] , ! . 
section_header(S, Rest, 'report physical examination item', 
' PHYSICAL EXAM: . 1 ) : - 

S = [physical, examination, ' : ' |Rest] , ! . 
section_header(S, Rest, 'report chief complaint item', 'CHIEF COMPLAINT:.') :- 

S = [chief , complaint, ':' |Rest] , i. 
section_header (S, Rest, 'report hospital course item' ,' HOSPITAL COURSE:.') :- 

S = [hospital, course, 1 : ' |Rest] , i . 
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sectionjieader (S, Rest, 'report allergy item' , 'ALLERGIES : . 1 ) :- 
S = [allergies, ' : 1 |Rest] , I . 

section_header (S # Rest, 'report follow up item' , 1 FOLLOW UP: . ') :- 

S = [follow, up, ' : ' | Rest] , ! . 
sectionjieader ( S , Rest , ' report findings item ' , ' FINDINGS : . ' ) : - 

S = [findings, 1 : 1 [Rest] , I. 
sectionjieader (S, Rest, 'report indications and findings item ' , ' FINDINGS : . 1 ) 

S = [indications, and, findings, ':' |Rest] , !. 
section_header(S, Rest, 'report indications and findings item' , 1 INDICATIONS : . 

S = [indications, ' : ' |Rest] , ! . 
sectionjieader (S, Rest, 'report provisional diagnosis item f , 1 PRELIM DIAGNOSIS 

S = [provisional, diagnosis, ':' (Rest] , 1. 
sectionjieader (S, Rest, 'report review of systems item' ,' REVIEW OF SYSTEMS:. 1 

S = [review, of , systems, ' : ' |Rest] , ! . 
section_header (S, Rest, 'report past history item', 'PAST MEDICAL HISTORY:.') 

S = [past, history, section, T :' |Rest] , I. 
sect ion_header (S, Rest, 'report past history item' , 'PAST MEDICAL HISTORY: . ') 

S = [past, medical, history, ' : ' |Rest] , ! . 
sectionjieader (S, Rest, 'report social history item' ,' SOCIAL HISTORY:.') :- 

S = [social, history, 1 : T |Rest] , I. 
section_header(S, Rest, 'report past history item', 'PAST MEDICAL HISTORY:.') 

S = [history, 1 : ' |Rest] , ! . 
sectionjieader (S, Rest, 'report past history item', 'PAST MEDICAL HISTORY:.') 

S = [brief , history, 1 :' |Rest] , !. 
sect ionjieader (S, Rest, ' report history of present illness item', 
'HISTORY OF PRESENT ILLNESS:.') :- 

S = [history, of , present, illness, r :' [Rest] , I. 
section_header(S, Rest, 'report history of present illness item', 
'HISTORY OF PRESENT ILLNESS:.') :- 

S = [history, of, the, present, illness, 1 :* |Rest] , !. 
section_header(S, Rest, 'report specimen item ',' SPECIMEN ' ) :- 

S = [specimenj Rest] , !. 

% sentence consists of id number only or "." only, 
isidentif ier ( [X, . ] ) : - 

integer (X) . 
isidentif ier ( [X, ;] ) :- 

integer (X) . 

isidentif ier ([.] ) :- !. % sentence consists only of . 
isidentif ier (['.', ' <eos> ' ] ) : - I . 

isidentif ier ( ['<' ,p, '/','>'] ) :- % paragraph marker sentence - update no. 
paragno (N) , 

retractall (paragno (_) ) , 
Newno is N + 1, 
assert (paragno (Newno) ) , 
retractall (sentno (_) ) , 
assert (sentno (0) ) . 

% skipsentence is true, if sentence should be ignored. 
% Skip sentences containing family info 
skipsentence { [X | __] ) :- 

foundword(X, family) , ! . 
skipsentence ( [X | __] ) :- 

foundword (X, insurance) , I. 
% This occurs if sentence contains 
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% a sequence in skips database and sentence also contains findings. 

skipsentence ( [x|Rest] , Semi is t, Error) :- 

skips ( [x|Sseq] ) , % X is the beg. of subseq. in skip database 
prefix ( [X | Rest] , [xjSseq] ) , % sentence contains subseq. 
(subtype (_,Semlist) , % sentence contains information to be extracted 
Error = no; % don't try to segment 

Error = yes), !. % treat sentence as error and try to segment. 

skipsentence ( [_ | Rest] , Semlist , Error) : - 
skipsentence (Rest, Semlist , Error) . 

% f indingseg (+S, -Fseg, -Begseg) : partitions sentence 

% S is the sentence; Begseg is the segment preceding the 

% modifiers of the finding; Fseg is the segment of S starting 

% with the leftmost modifier of the finding and consists of the 

% remaining sentence. 

findingseg(S, Fseg, Begseg) : - 

partition (S, Begpart , Restpart) , 
(Begpart = [] , Begseg = [] ; 
Restpart = [] , Fseg = [] , Begseg = S; 
rightlstmod (Begpart, Begseg, Modseg) ) , 
append (Modseg, Restpart , Fseg) . 
f indingseg (_, [] ,_) :- ! . 
actionf indingseg (S , Fseg, Begseg) : - 

partition (S, Begpart, Restpart) , 
(Begpart = [] , Begseg = [] ; 
Restpart = [] , Fseg = [] , Begseg = S; 
reverse (Begpart , ReversedBef ore) , 

f indsubstance (ReversedBef ore , Rest) , 
append (Subs tancepart , Rest , ReversedBef ore) , 
reverse (Subs tancepart , Lef tpart) , 
reverse (Rest, Begseg) , 
append (Lef tpart, Restpart, Fseg) ) . 
actionf indingseg (_, [] ,_) :- I . 
f indsubstance ([],[]):- I . 
f indsubstance { [X | Rest] , Rest) : - 

substance {_, [X] ,[]),! . 
f indsubstance ( [XjRestl] ,Rest) :- 
f indsubstance (Restl, Rest) . 

% partition (+S, -Begpart, -Restpart) : partitions sentence 
% S is initial 

% partition (+S, -Begpart, -Restpart) : partitions sentence 

% S is initial sentence; Begpart is part of sentence before the 

% finding; Restpart is the rest of the sentence and starts with 

% the finding. If there are 2 consecutive findings 

% the 1st one is considered a modifier 

partition ( [] ,[],[]) :- ! . 

partition ( [X | Rest] , [X | Begpart] , Restpart) : - 

not (isfinding(X) ) , 1, partition (Rest , Begpart , Restpart) . 

partition ( [X,Y|Rest] , [X] , [YjRest] ) :- 
isf inding (X) , isf inding (Y) , !. 

partition ( [X | Rest] , [] , [X | Rest] ) : - 
isf inding (X) , I . 

% isf inding (+X) : is true if X is a word or phrase whose semantic class 
% is a finding or subtype of finding. 
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isf inding (X) :- 

foundword(X,Sem) , % semantic class of word 

subtype (__, [Sem] ) . % is class a type of finding, recommend, or technique 

% semantic class which are types of relevant information 
subtype ( finding , Sem) : - 

intersect (Sem, [attach, createbond, breakbond, activate, 
inactivate , substitute , transcribe , express , promote , 
signal] ) . 

% there is only one type of technique class 
subtype { technique , Sem) : - 

member (technique, Sem) . 
subtype (time, Sem) :- 

intersect (Sem, [status , sstatus , change, tmper, vstatus] ) . 
f indinginlist (Sem) : - 

intersect (Sem, [attach, createbond, breakbond, activate, 
inactivate , substitute , transcribe , express , promote , 
signal] ) . 

% chkforpunct <+W,+Rest) : is true if there should be no space after word W 
chkf orpunct (W, _) : - member (W, [ ' / ■ , ' < ' , 1 > 1 , ' - ' , 1 " * , ' [ ' , * 1 * , 

1 { 1 , 1 } ' , *_' , ' + ' , ' = ' t ' I ' , , \' ] ) < 1 • 
% nothing left to write, 
chkforpunct (W, [] ) :-l . 

% is true if there should be no space before word after current word 
chkforpunct {_, [W | _] ) : - 
ispunct (W) . 

% ispunct (+W) is true if W is a punctuation for sentence ^print^out 

% The following characters are not treated as punct : ~ " # $ A & * 

ispunct (W) :- member (W, [ \ ' ,'.',';','/' , ' ' ' 1 , 1 ^ 1 = 1 < ' ' ' [ ' ' ' 3 ' ' 

•{','}','(',')', , _',' + , / t '\', '%'/ '©*]) * 

% rightlstmod(List,Firstpart,Modpart) : Modpart begins with the first 
% word in List which is a modifier; Firstpart are the preceding words 
rightlstmod( [],[],[]) :- I - 

% X is a modifier or finding; Beginning part is empty 
rightlstmod ( [X | Rest] , [] , [X | Rest] ) : - 
foundword(X, Sem, Target) , 

(modifier (Sem) ; Sem = p, Target = [_, conn] ; subtype (__, [Sem] )) , !. 
% X is not a modifier or finding 
rightlstmod ( [x|Rest] , [X | Firstpart] , Modpart) :- 

rightlstmod (Rest , Firstpart, Modpart) . 

% frame (Frame, Type, Value, Mods ) : creates a list Frame, whose 1st 

% element is Type, 2nd element is Value, and 3rd is a list of 

% modifier frames or is emtpy 

% Case where modifier list is empty; Value should be atom except for 
% certain types; 

frame ( [Type , Value] , Type , Value , X) : - 
(X = [] ; X = [[]]) , 
atom (Value) , ! . 
% Special cases where value of type should be a list 
frame ( [Type, [H | R] ] ,Type, [H|R] ,X) 
(X = [] ; X = [[]] ) , 
oklist (Type) , ! . 

% Modifier list is merged with list consisting of Type and Value 
frame (Frame, Type, Value, Mods) :- 
atom (Value) , 

append ( [Type, Value] , Mods , Frame) , I . 



frame ( Frame , Type , [H | R] , Mods ) : - 
is_list (R) , 

append { R , Mods , NewMods ) , 
append ( [Type, H] , NewMods, Frame), !. 
% Components of Frame 

frame( [Type, Value) Mods] , Type, Value, Mods) :- !. 

% Value of Type should not be a list; first element of value is real value 

frame ( [Type, H, Rest] , Type, [H | Rest] , [] ) :- !. 

% Special cases where value of type should be a list 

%frame( [Type, [H|R]],Type, [H|R], []) :- %repeated from rule above 

% oklist (Type) , ! . 

% Value of Type should not be a list; first element of value is real value 
frame (Frame, Type, [H|Rest] ,Mods) :- 

mergemods (Rest, Mods, NewMods) , 

append ( [Type,H] , NewMods , Frame) . 

% mergemodinf (-F, +Frame, +Mods) : Frame is a type -value -mod frame; Mods 
% is an additional set of modifiers for Frame; mergemodinf adds Mods 
% to Frame, resulting in F. 
mergemodinf ([],[],_):-!. 
mergemodinf (F, [rel,X|Rest] ,Modrel) :- 

mergemodinf (Fl, Rest, Modrel) , 

append ( [rel,X] ,F1,F) , ! . 
mergemodinf (F, [Fl, X |Modf in] ,Modrel) :- 

atom(Fl) , mergemods (Modrel, Modfin, Mod) , 

append( [F1,X] ,Mod,F) , ! . 
mergemodinf (F, [H|R] , Modrel) :- 

mergemodinf ( Fl , H , Modrel ) , 

mergemodinf ( F2 , R , Modrel ) , 

append ( [Fl] ,F2,F) . 

% addmodstof (+Args , +Mods , -NewArgs) is true if Args is a list of formats, 

% Mods is a list of modifiers and NewArgs is a list of formats where Mods 

% has been added to modifier list of that format 

addmodstof ( [],_,[]):-!. % no more formats 

addmodstof ( [Format 1 | Rest] ,Mods, [Fl|NewRest] ) :- 

mergemodinf (Fl,Formatl, Mods) , % merge modifiers into 1st format 
addmodstof (Rest, Mods, NewRest) , I. %add modifier to remaining 

% oklist (+Type) : is true if Type can have a list as its value 

oklist (unitval) . 

oklist (age) . 

oklist (measure) . 

oklist (prev__timeunit) . 

oklist (future_exam) . 

% mergemods (+Modsl, +Mods2, -Mod) : Modsl and Mods2 are a list of modifier lists 

% Mod is the merged list; some elements of Modsl and Mods2 may be 

% empty 

mergemods ([] ,M,M) :- !. 

mergemods (M, [] ,M) . 

mergemods (Modsl , Mods 2 , Mod) :- 

delete (Modsl, [] ,M1) , 

delete (Mods2, [] ,M2) , 

append (Ml , M2 , Mod) . 

% addmod(+Mod,+Modlist, -NewMod) : NewMod is formed by including 
% Mod into Modlist 

addmod { [ ] , Mod, Mod) : - ! . 



addmod (Mod, [] , [Mod] ) : - ! . 
addmod (Mod, Modlist , NewMod) : - 

append ( [Mod] , Modlist , NewMod) . 
% modlist <+ListofMods / -Mods) : ListofMods is a list consisting of 
% individual modifier frames, some of which may be empty 
% Mods is formed as a list of non-empty modifiers 
modlist ( [],[]) :- ■ . 

% ignore a modifier which is an empty list 
modlist ( [ [] |R] ,Mods) : - 

modlist (R, Mods) , I . 
modlist ( [[H|R1] |R2] ,Mods) :- 

atom (H) , ! , 

modlist (R2 , Rmods) , 

addmod ( [H|R1] , Rmods, Mods) . 
modlist ( [ [H|R1] |R2] ,Mods) : - 

is__list(H), i , % is first element is a list 

modlist (R2 , Rmods) , 

mergemods ( [H|R1] , Rmods, Mods) . 

%bpframe: creates from for sequences of bodyloc/region/position 
bpframe(F, [] ,_, F, []}:-! . % only 1 bodyloc 

bpframe(F, [] , Type, Bpl,Bp2) :- % no conj relation but more than 1 bodyloc 
frame (Bpl, BplType, Bp lVal , BplMods) , %contents of Bpl frame 
frame (Bp2,Bp2Type,Bp2Val, Bp2Mods) , %contents of Bp2 frame 
{ (BplType = region; BplType = position) , 
Bp2Type = bodyloc, % 'left lung', 'area of lung 1 
mergemods (BplMods, Bp2Mods , BpMods) , %new region modifier 
frame (NewBp2Mods, BplType, BplVal, BpMods ) , %new Bpl frame w new mod 
frame (F, Bp2Type, Bp2Val , [NewBp2Mods] ) % main frame is bodyloc 

BplType = bodyloc, Bp2Type = bodyloc, Type = main, %Bp2 is mam 

mergemods (BplMods, Bp2Mods, BpMods) , %new bodyloc modifier 

frame (NewBp2Mods, BplType, BplVal, BpMods ) , % 'joint of shoulder' 

frame (F, Bp2Type, Bp2Val, [NewBp2Mods] ) % main bp frame is shoulder 

mergemods (BplMods , Bp2Mods , BpMods) , 

frame (NewBplMods , Bp2Type , Bp2Val , BpMods) , % ' shoulder j oint ' 

frame (F, BplType, BplVal, [NewBplMods]) % main bp frame is shoulder 

bpframe(F!Rel,_,Bpl,Bp2) :- % no conj relation but more than 1 bodyloc 
Rel = [rel,Conj|J, Bp2 \= [] , 
mergemods ( [Bpl] , [Bp2] ,Conjargs) , 
f r ame ( F , r e 1 , Conj , Conj args ) . 

getrelation (R, Fl , F2 , F) : - 
(F2 \= [], 

(Fl = [rel,Conjl|Restl] , R = [rel, Conj], 

(Conjl = ' , ' ; Conjl - or; Conjl = and) , 
(Conj = , ,*; Conj = or; Conj = and) ; 

Restl = [Fl] ) , 
(F2 = [rel,Conj2 |Rest2] , 

(Conj2 = ','; Conj2 = or; Conj2 = and) ; 

Rest2 = [F2] ) , 
%splice( [R, Restl, Rest2] ,F) ; 
append ( [R, Restl, Rest2] ,F) ; 
F2 = [] , F = Fl ) . 



uptotal : - 

addstotal (X) , 

X =< 50, 

NewX is X + 1, 

retractall (addstotal (X) ) , 

assert (addstotal (NewX) ) , 
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$save{ 


1 a 1 


} = 


' AAAC ' ; 


$save{ 


■b» 


} = 


» AAAG 1 ; 


$ s ave { 


1 c ' 


} = 


' AAAT 1 ; 


$save{ 


'd' 


} = 


' AACC » ; 


$save{ 


' e 1 


} = 


' AACG 1 ; 


$save{ 


'f ' 


} = 


' AACT 1 ; 


$save { 


'g' 


} = 


* AAGC 1 ; 


$save { 


'h' 


} = 


' AAGG ' ; 


$ s ave { 


' i ' 


} - 


' AAGT ' ; 


$save{ 


' j * 


} = 


f AATC ' ; 


$ s ave { 


'k- 


} = 


1 AATG ' ; 


$ s ave { 


'1' 


} = 


' AATT 1 ; 


$save { 


' m ' 


} = 


' ACAC 1 ; 


$save { 


' n ' 


} = 


' ACAG 1 ; 


$save { 


' o ' 


} ~ 


1 ACAT ' ; 


$save { 


'P' 


} = 


■ACCC ; 


$save { 


'q' 


} = 


'ACCG 1 ; 


$ s ave { 


i r i 


} = 


' ACCT 1 ; 


$save { 


' s ' 


} = 


' ACGC ' ; 


$ s ave { 


1 1 ' 


} = 


' ACGG 1 ; 


$save{ 


' u' 


} = 


* ACGT 1 ; 


$ s ave { 


' V 1 


} = 


' ACTC ' / 


$save{ 


' w 1 


} = 


* ACTG 1 ; 


$save { 


1 X 1 


} = 


1 ACTT 1 ; 


$save{ 


f y' 


} = 


'AGAG' ; 


$save{ 


' z 1 


} - 


' AG AT 1 ; 


$save{ 


' 0 1 




■ AGCC 1 ; 


$save{ 


'l 1 


j = 


1 AGCG 1 ; 


$save{ 


' 2 1 


) = 


T AGCT 1 ; 


$save{ 


» 3 ' 


} = 


' AGGC ' ; 


$save{ 


' 4 ' 


} = 


' AGGG 1 ; 


$ s ave { 


f 5 ■ 


} = 


' AGGT 1 ; 


$save{ 


* 6 ' 


} = 


' AGTC ' ; 


$save { 


, 7 i 


} = 


' AGTG ' ; 


$ s ave { 


1 8 1 


} = 


■ AGTT 1 ; 


$save { 


19. 


| = 


' ATAT ' ; 


$save{ 




j = 


1 ATCC » ; 


$save{ 


'] ' 


} = 


1 ATCC 1 ; 


$save{ 


1 [' 


} ~ 


' ATCC 1 ; 


$save{ 


} 


] = 


1 ATCC ' ; 


$save{ 


' : ' 


} — 


■ ATCC 1 ; 


$save{ 


1 tt t 


} = 


' ATCC ■ ; 


$save{ 


■v 




= ' ATTC ' 


$save{ 


1 •? i 




' ATCC ' ; 


$save{ 


1 r ' 




' ATCC * ; 


$save{ 


'#■ 




1 CCCG ' ; 


$save{ 






1 CCCT r ; 


$save{ 


[ A i 




1 CCGG ' ; 


$save{ 


' &' 




1 CCGT ' ; 


$save{ 


t * t 




' CCTG ' ; 


$save{ 


1 (' 




' ATCC • ; 


$save { 


■) ' 




' ATCC 1 ; 
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S 3V6 | 


i i 






SclVG | 








save { 


i _t_ i 




L-vj^j 1 


save { 


f „ r 






save { 


} J 




CGI 1 


save { 


{ . 




LlLl 


save { 






'AILL 


save { 






AILL 


save { 


' 1 ' 






save { 


I 9- i 1 
o 




Li i 1 


save { 


/ J 




I ATPP ' 
r\ ± \_<>_ 


save | 


' \ \ 


\. 


_ i ggtT 


save{ 






i GTGT 1 


save{ 


»\n* 


'}= 


= ' ATCC 


save { 


■ < ' " 




, GTTT , 


save { 


' > ' ] 




, GTTT . 


save { 
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# ! /usr/bin/perl 
#Scan.pl : Scans blast output 
#Author: Michael Krauthammer 
#Copyright : c.1999, Columbia University 

#Variables 

#blast input/file 

$input_f ile="genebank. result" ; 

#program output 

$output_f ile= "match . txt " ; 

#open datastream for file which contains blast output 

open (INPUT, • /storage/psi -blast/Marklt/programs/marklt . result ' ) 

while ($line-<IMPUT>) { 

if ($line=~/\>gi\| (\d*) (.*) \ ,(.*) \ ,(.*)/) { 
$target=$4 ; 
$gi =$1; 

$semantic_class=$3 ; 

} 

if ($line=~/Length = (.*)/){ 
$lengthl=$l; 

} 

if ($line=~/Identities \= (\d*)\//){ 
$length_actual=$l 
} 

if ($line=~/Query: (\d*)/){ 
$start=$l; 

} 

#print if Subj 1, sometimes match 2 or 3 line long 

if ($line=~/Sbjct : 1 /){ 
if ( ($length_actual/$lengthl) > .9){ 
print 

$target, » | " , $start, " | $start+$lengthl , " | » , $semantic_class , » | » , $gi, 

} 
} 
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# ! /usr/bin/perl 
#nucleotide__text_parser .pi 

#Author: Michael Krauthammer, c.1999 Columbia University 



open ( INPUT , $ ARGV [ 0 ] ) ; 



#read uncoded input text line by line (chop it) 
$all=' ' ; 

while ($line=<INPUT>) { 
$all=$all . $line; 

} 

open (INPUTII, ' /storage/psi-blast/Marklt/programs/markltll . result 1 ) ; 
open (OUTPUT , ' >result . txt f ) ; 

#first part: check matches, store positions 
while {$line=<INPUTII>) { 

t 

($name / $ start , $end, $ semantical ass , $gi) =$line=~/ ( . *) \ | ( . *) \ | ( . *) \ | ( . *) \ | (.*)/; 

#divide by 4 (4 letter code) 
$start= (Sstart-1) /4; 
$end= ($end-l) /4; 



#get substring 
if ($ start != 0) { 

$letters=substr <$all, $start-l, $end- $start+3 ) . " | » ; 
} else { 

$letters = ' 1 . substr {$all, 0, $end+2) | " ; 

} 

($letter_beginning) =$letters=~/ ( A . ) / ; 
$ let ter_end= substr ($all , $end, 1) ; 
$letter_endll=substr ($all , $end, 2 ) ; 

#ignore matches that are in the MIDDLE of sentences, allow plurals 
$letter_beginning=~tr/ [A-Z] / [a-z] / ; 
$letter_end=~tr/ [A-Z] / [a-z] / ; 

if ( ( ! ($letter_beginning=~/ [a-z] /) ) && ( { ! ( $letter_end=~/ [a-z] / ) ) | 
($letter_endll=~/s /) ) ) { 

#make sure only the first occurence is stored at this position 
if ($save{$start}== ' ' ) { 

$save{$start }=$end. 1 | 1 . $semantic_class . 1 | ' . $gi; 

} 

foreach $key (keys (%save) ) { 
($end_key) =$save{$key}=~/ A ( . *) \ j / ; 
if ($end_key>$end) { 
if ($key<$start) { 

$ save {$ start } = ' null 1 , 

} 

} 

} 

} 
} 
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#second part: print out marked up document 
sort (%save) ; 

for ($i = 0;$i<length($all) ;$i++) { 

if ((!$save{$i}=='null') ($save{$i } =-/ . /) ) { 
($end, $semantic__class) =$save{ $i}=-/ ( . *) \ | ( . *) \ | /; 
print OUTPUT 1 <phr= " ' , $semantic_class , 1 ">' ; 
$store=substr ($all, $i, $end-$i) ; 
print OUTPUT $ store; 
print OUTPUT "</phr>"; 
$i=$end-l ; 

} else { 
$store=substr ($all , $i , 1} ; 
print OUTPUT $ store; 

} 

} 
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